Sample records for specific sequence features

  1. Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures

    PubMed Central

    Wang, Ying; Fu, Lei; Ren, Jie; Yu, Zhaoxia; Chen, Ting; Sun, Fengzhu

    2018-01-01

    Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “group-specific” in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO. PMID:29774017

  2. Visualizing bacterial tRNA identity determinants and antideterminants using function logos and inverse function logos

    PubMed Central

    Freyhult, Eva; Moulton, Vincent; Ardell, David H.

    2006-01-01

    Sequence logos are stacked bar graphs that generalize the notion of consensus sequence. They employ entropy statistics very effectively to display variation in a structural alignment of sequences of a common function, while emphasizing its over-represented features. Yet sequence logos cannot display features that distinguish functional subclasses within a structurally related superfamily nor do they display under-represented features. We introduce two extensions to address these needs: function logos and inverse logos. Function logos display subfunctions that are over-represented among sequences carrying a specific feature. Inverse logos generalize both sequence logos and function logos by displaying under-represented, rather than over-represented, features or functions in structural alignments. To make inverse logos, a compositional inverse is applied to the feature or function frequency distributions before logo construction, where a compositional inverse is a mathematical transform that makes common features or functions rare and vice versa. We applied these methods to a database of structurally aligned bacterial tDNAs to create highly condensed, birds-eye views of potentially all so-called identity determinants and antideterminants that confer specific amino acid charging or initiator function on tRNAs in bacteria. We recovered both known and a few potentially novel identity elements. Function logos and inverse logos are useful tools for exploratory bioinformatic analysis of structure–function relationships in sequence families and superfamilies. PMID:16473848

  3. Phylogeny of isolates of Prunus necrotic ringspot virus from the Ilarvirus Ringtest and identification of group-specific features.

    PubMed

    Hammond, R W

    2003-06-01

    Isolates of Prunus necrotic ringspot virus (PNRSV) were examined to establish the level of naturally occurring sequence variation in the coat protein (CP) gene and to identify group-specific genome features that may prove valuable for the generation of diagnostic reagents. Phylogenetic analysis of a 452 bp sequence of 68 virus isolates, 20 obtained from the European Union Ilarvirus Ringtest held in October 1998, confirmed the clustering of the isolates into three distinct groups. Although no correlation was found between the sequence and host or geographic origin, there was a general trend for severe isolates to cluster into one group. Group-specific features have been identified for discrimination between virus strains.

  4. Evaluating, Comparing, and Interpreting Protein Domain Hierarchies

    PubMed Central

    2014-01-01

    Abstract Arranging protein domain sequences hierarchically into evolutionarily divergent subgroups is important for investigating evolutionary history, for speeding up web-based similarity searches, for identifying sequence determinants of protein function, and for genome annotation. However, whether or not a particular hierarchy is optimal is often unclear, and independently constructed hierarchies for the same domain can often differ significantly. This article describes methods for statistically evaluating specific aspects of a hierarchy, for probing the criteria underlying its construction and for direct comparisons between hierarchies. Information theoretical notions are used to quantify the contributions of specific hierarchical features to the underlying statistical model. Such features include subhierarchies, sequence subgroups, individual sequences, and subgroup-associated signature patterns. Underlying properties are graphically displayed in plots of each specific feature's contributions, in heat maps of pattern residue conservation, in “contrast alignments,” and through cross-mapping of subgroups between hierarchies. Together, these approaches provide a deeper understanding of protein domain functional divergence, reveal uncertainties caused by inconsistent patterns of sequence conservation, and help resolve conflicts between competing hierarchies. PMID:24559108

  5. Domain-specific learning of grammatical structure in musical and phonological sequences.

    PubMed

    Bly, Benjamin Martin; Carrión, Ricardo E; Rasch, Björn

    2009-01-01

    Artificial grammar learning depends on acquisition of abstract structural representations rather than domain-specific representational constraints, or so many studies tell us. Using an artificial grammar task, we compared learning performance in two stimulus domains in which respondents have differing tacit prior knowledge. We found that despite grammatically identical sequence structures, learning was better for harmonically related chord sequences than for letter name sequences or harmonically unrelated chord sequences. We also found transfer effects within the musical and letter name tasks, but not across the domains. We conclude that knowledge acquired in implicit learning depends not only on abstract features of structured stimuli, but that the learning of regularities is in some respects domain-specific and strongly linked to particular features of the stimulus domain.

  6. Discriminative prediction of mammalian enhancers from DNA sequence

    PubMed Central

    Lee, Dongwon; Karchin, Rachel; Beer, Michael A.

    2011-01-01

    Accurately predicting regulatory sequences and enhancers in entire genomes is an important but difficult problem, especially in large vertebrate genomes. With the advent of ChIP-seq technology, experimental detection of genome-wide EP300/CREBBP bound regions provides a powerful platform to develop predictive tools for regulatory sequences and to study their sequence properties. Here, we develop a support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features. Moreover, we find that the predictive sequence features identified by the SVM classifier reveal biologically relevant sequence elements enriched in the enhancers, but we also identify other features that are significantly depleted in enhancers. The predictive sequence features are evolutionarily conserved and spatially clustered, providing further support of their functional significance. Although our SVM is trained on experimental data, we also predict novel enhancers and show that these putative enhancers are significantly enriched in both ChIP-seq signal and DNase I hypersensitivity signal in the mouse brain and are located near relevant genes. Finally, we present results of comparisons between other EP300/CREBBP data sets using our SVM and uncover sequence elements enriched and/or depleted in the different classes of enhancers. Many of these sequence features play a role in specifying tissue-specific or developmental-stage-specific enhancer activity, but our results indicate that some features operate in a general or tissue-independent manner. In addition to providing a high confidence list of enhancer targets for subsequent experimental investigation, these results contribute to our understanding of the general sequence structure of vertebrate enhancers. PMID:21875935

  7. TFBSshape: a motif database for DNA shape features of transcription factor binding sites.

    PubMed

    Yang, Lin; Zhou, Tianyin; Dror, Iris; Mathelier, Anthony; Wasserman, Wyeth W; Gordân, Raluca; Rohs, Remo

    2014-01-01

    Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.

  8. TFBSshape: a motif database for DNA shape features of transcription factor binding sites

    PubMed Central

    Yang, Lin; Zhou, Tianyin; Dror, Iris; Mathelier, Anthony; Wasserman, Wyeth W.; Gordân, Raluca; Rohs, Remo

    2014-01-01

    Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein–DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone. PMID:24214955

  9. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences.

    PubMed

    Chen, Zhen; Zhao, Pei; Li, Fuyi; Leier, André; Marquez-Lago, Tatiana T; Wang, Yanan; Webb, Geoffrey I; Smith, A Ian; Daly, Roger J; Chou, Kuo-Chen; Song, Jiangning

    2018-03-08

    Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection, and dimensionality reduction algorithms, greatly facilitating training, analysis, and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit. http://iFeature.erc.monash.edu/; https://github.com/Superzchen/iFeature/. jiangning.song@monash.edu; kcchou@gordonlifescience.org; roger.daly@monash.edu. Supplementary data are available at Bioinformatics online.

  10. A Feature-Based Approach to Modeling Protein–DNA Interactions

    PubMed Central

    Segal, Eran

    2008-01-01

    Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. PMID:18725950

  11. Learning Behavior Characterization with Multi-Feature, Hierarchical Activity Sequences

    ERIC Educational Resources Information Center

    Ye, Cheng; Segedy, James R.; Kinnebrew, John S.; Biswas, Gautam

    2015-01-01

    This paper discusses Multi-Feature Hierarchical Sequential Pattern Mining, MFH-SPAM, a novel algorithm that efficiently extracts patterns from students' learning activity sequences. This algorithm extends an existing sequential pattern mining algorithm by dynamically selecting the level of specificity for hierarchically-defined features…

  12. Analysis and Prediction of Exon Skipping Events from RNA-Seq with Sequence Information Using Rotation Forest.

    PubMed

    Du, Xiuquan; Hu, Changlin; Yao, Yu; Sun, Shiwei; Zhang, Yanping

    2017-12-12

    In bioinformatics, exon skipping (ES) event prediction is an essential part of alternative splicing (AS) event analysis. Although many methods have been developed to predict ES events, a solution has yet to be found. In this study, given the limitations of machine learning algorithms with RNA-Seq data or genome sequences, a new feature, called RS (RNA-seq and sequence) features, was constructed. These features include RNA-Seq features derived from the RNA-Seq data and sequence features derived from genome sequences. We propose a novel Rotation Forest classifier to predict ES events with the RS features (RotaF-RSES). To validate the efficacy of RotaF-RSES, a dataset from two human tissues was used, and RotaF-RSES achieved an accuracy of 98.4%, a specificity of 99.2%, a sensitivity of 94.1%, and an area under the curve (AUC) of 98.6%. When compared to the other available methods, the results indicate that RotaF-RSES is efficient and can predict ES events with RS features.

  13. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).

  14. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects.

    PubMed

    Liu, Bin; Liu, Fule; Fang, Longyun; Wang, Xiaolong; Chou, Kuo-Chen

    2015-04-15

    In order to develop powerful computational predictors for identifying the biological features or attributes of DNAs, one of the most challenging problems is to find a suitable approach to effectively represent the DNA sequences. To facilitate the studies of DNAs and nucleotides, we developed a Python package called representations of DNAs (repDNA) for generating the widely used features reflecting the physicochemical properties and sequence-order effects of DNAs and nucleotides. There are three feature groups composed of 15 features. The first group calculates three nucleic acid composition features describing the local sequence information by means of kmers; the second group calculates six autocorrelation features describing the level of correlation between two oligonucleotides along a DNA sequence in terms of their specific physicochemical properties; the third group calculates six pseudo nucleotide composition features, which can be used to represent a DNA sequence with a discrete model or vector yet still keep considerable sequence-order information via the physicochemical properties of its constituent oligonucleotides. In addition, these features can be easily calculated based on both the built-in and user-defined properties via using repDNA. The repDNA Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/repDNA/. bliu@insun.hit.edu.cn or kcchou@gordonlifescience.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  15. A common class of transcripts with 5'-intron depletion, distinct early coding sequence features, and N1-methyladenosine modification.

    PubMed

    Cenik, Can; Chua, Hon Nian; Singh, Guramrit; Akef, Abdalla; Snyder, Michael P; Palazzo, Alexander F; Moore, Melissa J; Roth, Frederick P

    2017-03-01

    Introns are found in 5' untranslated regions (5'UTRs) for 35% of all human transcripts. These 5'UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5'UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5'UTR intron status, we developed a classifier that can predict 5'UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5 ' proximal- i ntron- m inus-like-coding regions ("5IM" transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5' cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5' proximal positions. Finally, N 1 -methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5' proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N 1 -methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC. © 2017 Cenik et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.

  16. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets

    PubMed Central

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S.; Beer, Michael A.

    2013-01-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167–80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org. PMID:23771147

  17. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets.

    PubMed

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S; Beer, Michael A

    2013-07-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org.

  18. Cascade detection for the extraction of localized sequence features; specificity results for HIV-1 protease and structure-function results for the Schellman loop.

    PubMed

    Newell, Nicholas E

    2011-12-15

    The extraction of the set of features most relevant to function from classified biological sequence sets is still a challenging problem. A central issue is the determination of expected counts for higher order features so that artifact features may be screened. Cascade detection (CD), a new algorithm for the extraction of localized features from sequence sets, is introduced. CD is a natural extension of the proportional modeling techniques used in contingency table analysis into the domain of feature detection. The algorithm is successfully tested on synthetic data and then applied to feature detection problems from two different domains to demonstrate its broad utility. An analysis of HIV-1 protease specificity reveals patterns of strong first-order features that group hydrophobic residues by side chain geometry and exhibit substantial symmetry about the cleavage site. Higher order results suggest that favorable cooperativity is weak by comparison and broadly distributed, but indicate possible synergies between negative charge and hydrophobicity in the substrate. Structure-function results for the Schellman loop, a helix-capping motif in proteins, contain strong first-order features and also show statistically significant cooperativities that provide new insights into the design of the motif. These include a new 'hydrophobic staple' and multiple amphipathic and electrostatic pair features. CD should prove useful not only for sequence analysis, but also for the detection of multifactor synergies in cross-classified data from clinical studies or other sources. Windows XP/7 application and data files available at: https://sites.google.com/site/cascadedetect/home. nacnewell@comcast.net Supplementary information is available at Bioinformatics online.

  19. Analysis of B Cell Repertoire Dynamics Following Hepatitis B Vaccination in Humans, and Enrichment of Vaccine-specific Antibody Sequences.

    PubMed

    Galson, Jacob D; Trück, Johannes; Fowler, Anna; Clutterbuck, Elizabeth A; Münz, Márton; Cerundolo, Vincenzo; Reinhard, Claudia; van der Most, Robbert; Pollard, Andrew J; Lunter, Gerton; Kelly, Dominic F

    2015-12-01

    Generating a diverse B cell immunoglobulin repertoire is essential for protection against infection. The repertoire in humans can now be comprehensively measured by high-throughput sequencing. Using hepatitis B vaccination as a model, we determined how the total immunoglobulin sequence repertoire changes following antigen exposure in humans, and compared this to sequences from vaccine-specific sorted cells. Clonal sequence expansions were seen 7 days after vaccination, which correlated with vaccine-specific plasma cell numbers. These expansions caused an increase in mutation, and a decrease in diversity and complementarity-determining region 3 sequence length in the repertoire. We also saw an increase in sequence convergence between participants 14 and 21 days after vaccination, coinciding with an increase of vaccine-specific memory cells. These features allowed development of a model for in silico enrichment of vaccine-specific sequences from the total repertoire. Identifying antigen-specific sequences from total repertoire data could aid our understanding B cell driven immunity, and be used for disease diagnostics and vaccine evaluation.

  20. Applications of alignment-free methods in epigenomics.

    PubMed

    Pinello, Luca; Lo Bosco, Giosuè; Yuan, Guo-Cheng

    2014-05-01

    Epigenetic mechanisms play an important role in the regulation of cell type-specific gene activities, yet how epigenetic patterns are established and maintained remains poorly understood. Recent studies have supported a role of DNA sequences in recruitment of epigenetic regulators. Alignment-free methods have been applied to identify distinct sequence features that are associated with epigenetic patterns and to predict epigenomic profiles. Here, we review recent advances in such applications, including the methods to map DNA sequence to feature space, sequence comparison and prediction models. Computational studies using these methods have provided important insights into the epigenetic regulatory mechanisms.

  1. Putative Monofunctional Type I Polyketide Synthase Units: A Dinoflagellate-Specific Feature?

    PubMed Central

    Eichholz, Karsten; Beszteri, Bánk; John, Uwe

    2012-01-01

    Marine dinoflagellates (alveolata) are microalgae of which some cause harmful algal blooms and produce a broad variety of most likely polyketide synthesis derived phycotoxins. Recently, novel polyketide synthesase (PKS) transcripts have been described from the Florida red tide dinoflagellate Karenia brevis (gymnodiniales) which are evolutionarily related to Type I PKS but were apparently expressed as monofunctional proteins, a feature typical of Type II PKS. Here, we investigated expression units of PKS I-like sequences in Alexandrium ostenfeldii (gonyaulacales) and Heterocapsa triquetra (peridiniales) at the transcript and protein level. The five full length transcripts we obtained were all characterized by polyadenylation, a 3′ UTR and the dinoflagellate specific spliced leader sequence at the 5′end. Each of the five transcripts encoded a single ketoacylsynthase (KS) domain showing high similarity to K. brevis KS sequences. The monofunctional structure was also confirmed using dinoflagellate specific KS antibodies in Western Blots. In a maximum likelihood phylogenetic analysis of KS domains from diverse PKSs, dinoflagellate KSs formed a clade placed well within the protist Type I PKS clade between apicomplexa, haptophytes and chlorophytes. These findings indicate that the atypical PKS I structure, i.e., expression as putative monofunctional units, might be a dinoflagellate specific feature. In addition, the sequenced transcripts harbored a previously unknown, apparently dinoflagellate specific conserved N-terminal domain. We discuss the implications of this novel region with regard to the putative monofunctional organization of Type I PKS in dinoflagellates. PMID:23139807

  2. Prediction of enhancer-promoter interactions via natural language processing.

    PubMed

    Zeng, Wanwen; Wu, Mengmeng; Jiang, Rui

    2018-05-09

    Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput. We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method in natural language processing. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841~ 0.933 on different datasets, which outperforms existing methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~ 0.940 can be achieved by combining sequence embedding features and experimental features. EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths and provides a powerful approach for EPIs identification.

  3. PrimerMapper: high throughput primer design and graphical assembly for PCR and SNP detection

    PubMed Central

    O’Halloran, Damien M.

    2016-01-01

    Primer design represents a widely employed gambit in diverse molecular applications including PCR, sequencing, and probe hybridization. Variations of PCR, including primer walking, allele-specific PCR, and nested PCR provide specialized validation and detection protocols for molecular analyses that often require screening large numbers of DNA fragments. In these cases, automated sequence retrieval and processing become important features, and furthermore, a graphic that provides the user with a visual guide to the distribution of designed primers across targets is most helpful in quickly ascertaining primer coverage. To this end, I describe here, PrimerMapper, which provides a comprehensive graphical user interface that designs robust primers from any number of inputted sequences while providing the user with both, graphical maps of primer distribution for each inputted sequence, and also a global assembled map of all inputted sequences with designed primers. PrimerMapper also enables the visualization of graphical maps within a browser and allows the user to draw new primers directly onto the webpage. Other features of PrimerMapper include allele-specific design features for SNP genotyping, a remote BLAST window to NCBI databases, and remote sequence retrieval from GenBank and dbSNP. PrimerMapper is hosted at GitHub and freely available without restriction. PMID:26853558

  4. The memory is in the details: relations between memory for the specific features of events and long-term recall during infancy.

    PubMed

    Bauer, Patricia J; Lukowski, Angela F

    2010-09-01

    The second year of life is marked by pronounced changes in the length of time over which events are remembered. We tested whether the age-related differences are related to differences in memory for the specific features of events. In our study, 16- and 20-month-olds were tested for immediate and long-term recall of individual actions and temporal order of actions of three-step sequences in an elicited imitation paradigm as well as for forced-choice recognition of the specific feature of the props used to produce the sequences. Memory for the props was related to long-term recall of the events only for the 20-month-olds. It accounted for unique variance above and beyond the variance explained by immediate recall of the individual actions and the temporal order of actions of the sequences. The different pattern of relations in the older and younger infants seemingly reflects a developmental difference in the determinants of long-term recall over the second year of life. Copyright 2010 Elsevier Inc. All rights reserved.

  5. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features

    PubMed Central

    Mohammad-Noori, Morteza; Beer, Michael A.

    2014-01-01

    Abstract Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem. PMID:25033408

  6. Enhanced regulatory sequence prediction using gapped k-mer features.

    PubMed

    Ghandi, Mahmoud; Lee, Dongwon; Mohammad-Noori, Morteza; Beer, Michael A

    2014-07-01

    Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

  7. Atropos: specific, sensitive, and speedy trimming of sequencing reads.

    PubMed

    Didion, John P; Martin, Marcel; Collins, Francis S

    2017-01-01

    A key step in the transformation of raw sequencing reads into biological insights is the trimming of adapter sequences and low-quality bases. Read trimming has been shown to increase the quality and reliability while decreasing the computational requirements of downstream analyses. Many read trimming software tools are available; however, no tool simultaneously provides the accuracy, computational efficiency, and feature set required to handle the types and volumes of data generated in modern sequencing-based experiments. Here we introduce Atropos and show that it trims reads with high sensitivity and specificity while maintaining leading-edge speed. Compared to other state-of-the-art read trimming tools, Atropos achieves significant increases in trimming accuracy while remaining competitive in execution times. Furthermore, Atropos maintains high accuracy even when trimming data with elevated rates of sequencing errors. The accuracy, high performance, and broad feature set offered by Atropos makes it an appropriate choice for the pre-processing of Illumina, ABI SOLiD, and other current-generation short-read sequencing datasets. Atropos is open source and free software written in Python (3.3+) and available at https://github.com/jdidion/atropos.

  8. Atropos: specific, sensitive, and speedy trimming of sequencing reads

    PubMed Central

    Collins, Francis S.

    2017-01-01

    A key step in the transformation of raw sequencing reads into biological insights is the trimming of adapter sequences and low-quality bases. Read trimming has been shown to increase the quality and reliability while decreasing the computational requirements of downstream analyses. Many read trimming software tools are available; however, no tool simultaneously provides the accuracy, computational efficiency, and feature set required to handle the types and volumes of data generated in modern sequencing-based experiments. Here we introduce Atropos and show that it trims reads with high sensitivity and specificity while maintaining leading-edge speed. Compared to other state-of-the-art read trimming tools, Atropos achieves significant increases in trimming accuracy while remaining competitive in execution times. Furthermore, Atropos maintains high accuracy even when trimming data with elevated rates of sequencing errors. The accuracy, high performance, and broad feature set offered by Atropos makes it an appropriate choice for the pre-processing of Illumina, ABI SOLiD, and other current-generation short-read sequencing datasets. Atropos is open source and free software written in Python (3.3+) and available at https://github.com/jdidion/atropos. PMID:28875074

  9. Chromosomal location and gene paucity of the male specific region on papaya Y chromosome.

    PubMed

    Yu, Qingyi; Hou, Shaobin; Hobza, Roman; Feltus, F Alex; Wang, Xiue; Jin, Weiwei; Skelton, Rachel L; Blas, Andrea; Lemke, Cornelia; Saw, Jimmy H; Moore, Paul H; Alam, Maqsudul; Jiang, Jiming; Paterson, Andrew H; Vyskot, Boris; Ming, Ray

    2007-08-01

    Sex chromosomes in flowering plants evolved recently and many of them remain homomorphic, including those in papaya. We investigated the chromosomal location of papaya's small male specific region of the hermaphrodite Y (Yh) chromosome (MSY) and its genomic features. We conducted chromosome fluorescence in situ hybridization mapping of Yh-specific bacterial artificial chromosomes (BACs) and placed the MSY near the centromere of the papaya Y chromosome. Then we sequenced five MSY BACs to examine the genomic features of this specialized region, which resulted in the largest collection of contiguous genomic DNA sequences of a Y chromosome in flowering plants. Extreme gene paucity was observed in the papaya MSY with no functional gene identified in 715 kb MSY sequences. A high density of retroelements and local sequence duplications were detected in the MSY that is suppressed for recombination. Location of the papaya MSY near the centromere might have provided recombination suppression and fostered paucity of genes in the male specific region of the Y chromosome. Our findings provide critical information for deciphering the sex chromosomes in papaya and reference information for comparative studies of other sex chromosomes in animals and plants.

  10. [A family of short retroposons (Squaml) from squamate reptiles (Reptilia: Squamata): structure, evolution and correlation with phylogeny].

    PubMed

    Kosushkin, S A; Borodulina, O R; Solov'eva, E N; Grechko, V V

    2008-01-01

    We have isolated and characterised sequences of a SINE family specific for squamate reptiles from a genome of lacertid lizard that we called Squam1. Copies are 360-390 bp in length and share a significant similarity with tRNA gene sequence on its 5'-end. This family was also detected by us in DNA of representatives of varanids, iguanids (anolis), gekkonids, and snakes. No signs of it were found in DNA of mammals, birds, amphibians, and crocodiles. Detailed analysis of primary structure of the retroposons obtained by us from genomic libraries or GenBank sequences was carried out. Most taxa possess 2-3 subfamilies of the SINE in their genomes with specific diagnostic features in their primary structure. Individual variability of copies in different families is about 85% and is just slightly lower on the genera level. Comparison of consensus sequences on family level reveals a high degree of structural similarity with a number of specific apomorphic features which makes it a useful marker of phylogeny for this group of reptiles. Snakes do not show specific affinity to varanids when compared to other lizards, as it was suggested earlier.

  11. Horseradish peroxidase-labeled oligonucleotides and fluorescent tyramides for rapid detection of chromosome-specific repeat sequences.

    PubMed

    van Gijlswijk, R P; Wiegant, J; Vervenne, R; Lasan, R; Tanke, H J; Raap, A K

    1996-01-01

    We present a sensitive and rapid fluorescence in situ hybridization (FISH) strategy for detecting chromosome-specific repeat sequences. It uses horseradish peroxidase (HRP)-labeled oligonucleotide sequences in combination with fluorescent tyramide-based detection. After in situ hybridization, the HRP conjugated to the oligonucleotide probe is used to deposit fluorescently labeled tyramide molecules at the site of hybridization. The method features full chemical synthesis of probes, strong FISH signals, and short processing periods, as well as multicolor capabilities.

  12. An RNAi-Enhanced Logic Circuit for Cancer Specific Detection and Destruction

    DTIC Science & Technology

    2013-02-01

    monomeric protein secreted by Corynebacterium diphtheriae, and pro-apoptotic members of Bcl-2 family: mBax (Mus musculus), hBax ( Homo sapiens ), and its...Gata3 mStaple. Intron- feature sequences – donor site, branch point, poly- pyrimidine tract, and acceptor site – were selected based on previously...sequences found in literature our intron features were chosen according SplicePort [4], an online analyzer that detects the likelihood of splicing to

  13. The PRC2-binding long non-coding RNAs in human and mouse genomes are associated with predictive sequence features

    NASA Astrophysics Data System (ADS)

    Tu, Shiqi; Yuan, Guo-Cheng; Shao, Zhen

    2017-01-01

    Recently, long non-coding RNAs (lncRNAs) have emerged as an important class of molecules involved in many cellular processes. One of their primary functions is to shape epigenetic landscape through interactions with chromatin modifying proteins. However, mechanisms contributing to the specificity of such interactions remain poorly understood. Here we took the human and mouse lncRNAs that were experimentally determined to have physical interactions with Polycomb repressive complex 2 (PRC2), and systematically investigated the sequence features of these lncRNAs by developing a new computational pipeline for sequences composition analysis, in which each sequence is considered as a series of transitions between adjacent nucleotides. Through that, PRC2-binding lncRNAs were found to be associated with a set of distinctive and evolutionarily conserved sequence features, which can be utilized to distinguish them from the others with considerable accuracy. We further identified fragments of PRC2-binding lncRNAs that are enriched with these sequence features, and found they show strong PRC2-binding signals and are more highly conserved across species than the other parts, implying their functional importance.

  14. Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

    PubMed Central

    Faye, Ibrahima; Samir, Brahim Belhaouari; Md Said, Abas

    2014-01-01

    Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth. PMID:25045727

  15. SeqDepot: streamlined database of biological sequences and precomputed features.

    PubMed

    Ulrich, Luke E; Zhulin, Igor B

    2014-01-15

    Assembling and/or producing integrated knowledge of sequence features continues to be an onerous and redundant task despite a large number of existing resources. We have developed SeqDepot-a novel database that focuses solely on two primary goals: (i) assimilating known primary sequences with predicted feature data and (ii) providing the most simple and straightforward means to procure and readily use this information. Access to >28.5 million sequences and 300 million features is provided through a well-documented and flexible RESTful interface that supports fetching specific data subsets, bulk queries, visualization and searching by MD5 digests or external database identifiers. We have also developed an HTML5/JavaScript web application exemplifying how to interact with SeqDepot and Perl/Python scripts for use with local processing pipelines. Freely available on the web at http://seqdepot.net/. RESTaccess via http://seqdepot.net/api/v1. Database files and scripts maybe downloaded from http://seqdepot.net/download.

  16. Sequence-specific bias correction for RNA-seq data using recurrent neural networks.

    PubMed

    Zhang, Yao-Zhong; Yamaguchi, Rui; Imoto, Seiya; Miyano, Satoru

    2017-01-25

    The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance. We explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set. RNNs provides an alternative and flexible way to calculate sequence-specific bias without explicitly pre-determining sequence structures.

  17. ΔN-P63α and TA-P63α exhibit intrinsic differences in transactivation specificities that depend on distinct features of DNA target sites

    PubMed Central

    Foggetti, Giorgia; Raimondi, Ivan; Campomenosi, Paola; Menichini, Paola

    2014-01-01

    TP63 is a member of the TP53 gene family that encodes for up to ten different TA and ΔN isoforms through alternative promoter usage and alternative splicing. Besides being a master regulator of gene expression for squamous epithelial proliferation, differentiation and maintenance, P63, through differential expression of its isoforms, plays important roles in tumorigenesis. All P63 isoforms share an immunoglobulin-like folded DNA binding domain responsible for binding to sequence-specific response elements (REs), whose overall consensus sequence is similar to that of the canonical p53 RE. Using a defined assay in yeast, where P63 isoforms and RE sequences are the only variables, and gene expression assays in human cell lines, we demonstrated that human TA- and ΔN-P63α proteins exhibited differences in transactivation specificity not observed with the corresponding P73 or P53 protein isoforms. These differences 1) were dependent on specific features of the RE sequence, 2) could be related to intrinsic differences in their oligomeric state and cooperative DNA binding, and 3) appeared to be conserved in evolution. Since genotoxic stress can change relative ratio of TA- and ΔN-P63α protein levels, the different transactivation specificity of each P63 isoform could potentially influence cellular responses to specific stresses. PMID:24926492

  18. Visualization of protein sequence features using JavaScript and SVG with pViz.js.

    PubMed

    Mukhyala, Kiran; Masselot, Alexandre

    2014-12-01

    pViz.js is a visualization library for displaying protein sequence features in a Web browser. By simply providing a sequence and the locations of its features, this lightweight, yet versatile, JavaScript library renders an interactive view of the protein features. Interactive exploration of protein sequence features over the Web is a common need in Bioinformatics. Although many Web sites have developed viewers to display these features, their implementations are usually focused on data from a specific source or use case. Some of these viewers can be adapted to fit other use cases but are not designed to be reusable. pViz makes it easy to display features as boxes aligned to a protein sequence with zooming functionality but also includes predefined renderings for secondary structure and post-translational modifications. The library is designed to further customize this view. We demonstrate such applications of pViz using two examples: a proteomic data visualization tool with an embedded viewer for displaying features on protein structure, and a tool to visualize the results of the variant_effect_predictor tool from Ensembl. pViz.js is a JavaScript library, available on github at https://github.com/Genentech/pviz. This site includes examples and functional applications, installation instructions and usage documentation. A Readme file, which explains how to use pViz with examples, is available as Supplementary Material A. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  19. Quick and clean cloning.

    PubMed

    Thieme, Frank; Marillonnet, Sylvestre

    2014-01-01

    Identification of unknown sequences that flank known sequences of interest requires PCR amplification of DNA fragments that contain the junction between the known and unknown flanking sequences. Since amplified products often contain a mixture of specific and nonspecific products, the quick and clean (QC) cloning procedure was developed to clone specific products only. QC cloning is a ligation-independent cloning procedure that relies on the exonuclease activity of T4 DNA polymerase to generate single-stranded extensions at the ends of the vector and insert. A specific feature of QC cloning is the use of vectors that contain a sequence called catching sequence that allows cloning specific products only. QC cloning is performed by a one-pot incubation of insert and vector in the presence of T4 DNA polymerase at room temperature for 10 min followed by direct transformation of the incubation mix in chemo-competent Escherichia coli cells.

  20. Gene discovery in Eimeria tenella by immunoscreening cDNA expression libraries of sporozoites and schizonts with chicken intestinal antibodies.

    PubMed

    Réfega, Susana; Girard-Misguich, Fabienne; Bourdieu, Christiane; Péry, Pierre; Labbé, Marie

    2003-04-02

    Specific antibodies were produced ex vivo from intestinal culture of Eimeria tenella infected chickens. The specificity of these intestinal antibodies was tested against different parasite stages. These antibodies were used to immunoscreen first generation schizont and sporozoite cDNA libraries permitting the identification of new E. tenella antigens. We obtained a total of 119 cDNA clones which were subjected to sequence analysis. The sequences coding for the proteins inducing local immune responses were compared with nucleotide or protein databases and with expressed sequence tags (ESTs) databases. We identified new Eimeria genes coding for heat shock proteins, a ribosomal protein, a pyruvate kinase and a pyridoxine kinase. Specific features of other sequences are discussed.

  1. Robust k-mer frequency estimation using gapped k-mers

    PubMed Central

    Ghandi, Mahmoud; Mohammad-Noori, Morteza

    2013-01-01

    Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome. PMID:23861010

  2. Robust k-mer frequency estimation using gapped k-mers.

    PubMed

    Ghandi, Mahmoud; Mohammad-Noori, Morteza; Beer, Michael A

    2014-08-01

    Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

  3. The Genome sequences of four non-human/non-clinical Salmonella enterica serovar Kentucky ST198 isolates recovered between 1972 and 1973

    USDA-ARS?s Scientific Manuscript database

    Salmonella Kentucky is a polyphyletic member of S. enterica subclade A1 with multiple sequence types that often colonize the same hosts but in different frequencies on different continents. To evaluate the genomic features involved in S. Kentucky host specificity we sequenced the genomes of four iso...

  4. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach

    DOE PAGES

    Musumeci, Matias A.; Lozada, Mariana; Rial, Daniela V.; ...

    2017-04-09

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer-Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putativemore » monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. As a result, this work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments.« less

  5. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Musumeci, Matias A.; Lozada, Mariana; Rial, Daniela V.

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer-Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putativemore » monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. As a result, this work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments.« less

  6. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach.

    PubMed

    Musumeci, Matías A; Lozada, Mariana; Rial, Daniela V; Mac Cormack, Walter P; Jansson, Janet K; Sjöling, Sara; Carroll, JoLynn; Dionisi, Hebe M

    2017-04-09

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer-Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putative monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. This work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments.

  7. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach

    PubMed Central

    Musumeci, Matías A.; Lozada, Mariana; Rial, Daniela V.; Mac Cormack, Walter P.; Jansson, Janet K.; Sjöling, Sara; Carroll, JoLynn; Dionisi, Hebe M.

    2017-01-01

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer–Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putative monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. This work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments. PMID:28397770

  8. An RNAi-enhanced Logic Circuit for Cancer Specific Detection and Destruction

    DTIC Science & Technology

    2010-07-01

    Bcl-2 family: mBax (Mus musculus), hBax ( Homo sapiens ), and its mutant hBax-S184A [4]. A plasmid containing the tested gene was transfected into HEK...the far-red fluorescent protein mKate to express the Gata3 mStaple. Intron- feature sequences – donor site, branch point, poly- pyrimidine tract, and...intron-exon junction. Among the donor and acceptor sequences found in literature our intron features were chosen according SplicePort [5], an

  9. Effective Feature Selection for Classification of Promoter Sequences.

    PubMed

    K, Kouser; P G, Lavanya; Rangarajan, Lalitha; K, Acharya Kshitish

    2016-01-01

    Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.

  10. Anonymization of electronic medical records for validating genome-wide association studies

    PubMed Central

    Loukides, Grigorios; Gkoulalas-Divanis, Aris; Malin, Bradley

    2010-01-01

    Genome-wide association studies (GWAS) facilitate the discovery of genotype–phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients’ standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data “as is” may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients’ identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks. PMID:20385806

  11. Regulatory sequence analysis tools.

    PubMed

    van Helden, Jacques

    2003-07-01

    The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.

  12. Sequences Associated with Centromere Competency in the Human Genome

    PubMed Central

    Hayden, Karen E.; Strome, Erin D.; Merrett, Stephanie L.; Lee, Hye-Ran; Rudd, M. Katharine

    2013-01-01

    Centromeres, the sites of spindle attachment during mitosis and meiosis, are located in specific positions in the human genome, normally coincident with diverse subsets of alpha satellite DNA. While there is strong evidence supporting the association of some subfamilies of alpha satellite with centromere function, the basis for establishing whether a given alpha satellite sequence is or is not designated a functional centromere is unknown, and attempts to understand the role of particular sequence features in establishing centromere identity have been limited by the near identity and repetitive nature of satellite sequences. Utilizing a broadly applicable experimental approach to test sequence competency for centromere specification, we have carried out a genomic and epigenetic functional analysis of endogenous human centromere sequences available in the current human genome assembly. The data support a model in which functionally competent sequences confer an opportunity for centromere specification, integrating genomic and epigenetic signals and promoting the concept of context-dependent centromere inheritance. PMID:23230266

  13. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.

    PubMed

    Yin, Changchuan

    2015-04-01

    To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.

  14. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data.

    PubMed

    Fan, Yu; Xi, Liu; Hughes, Daniel S T; Zhang, Jianjun; Zhang, Jianhua; Futreal, P Andrew; Wheeler, David A; Wang, Wenyi

    2016-08-24

    Subclonal mutations reveal important features of the genetic architecture of tumors. However, accurate detection of mutations in genetically heterogeneous tumor cell populations using next-generation sequencing remains challenging. We develop MuSE ( http://bioinformatics.mdanderson.org/main/MuSE ), Mutation calling using a Markov Substitution model for Evolution, a novel approach for modeling the evolution of the allelic composition of the tumor and normal tissue at each reference base. MuSE adopts a sample-specific error model that reflects the underlying tumor heterogeneity to greatly improve the overall accuracy. We demonstrate the accuracy of MuSE in calling subclonal mutations in the context of large-scale tumor sequencing projects using whole exome and whole genome sequencing.

  15. Sequence Bundles: a novel method for visualising, discovering and exploring sequence motifs

    PubMed Central

    2014-01-01

    Background We introduce Sequence Bundles--a novel data visualisation method for representing multiple sequence alignments (MSAs). We identify and address key limitations of the existing bioinformatics data visualisation methods (i.e. the Sequence Logo) by enabling Sequence Bundles to give salient visual expression to sequence motifs and other data features, which would otherwise remain hidden. Methods For the development of Sequence Bundles we employed research-led information design methodologies. Sequences are encoded as uninterrupted, semi-opaque lines plotted on a 2-dimensional reconfigurable grid. Each line represents a single sequence. The thickness and opacity of the stack at each residue in each position indicates the level of conservation and the lines' curved paths expose patterns in correlation and functionality. Several MSAs can be visualised in a composite image. The Sequence Bundles method is designed to favour a tangible, continuous and intuitive display of information. Results We have developed a software demonstration application for generating a Sequence Bundles visualisation of MSAs provided for the BioVis 2013 redesign contest. A subsequent exploration of the visualised line patterns allowed for the discovery of a number of interesting features in the dataset. Reported features include the extreme conservation of sequences displaying a specific residue and bifurcations of the consensus sequence. Conclusions Sequence Bundles is a novel method for visualisation of MSAs and the discovery of sequence motifs. It can aid in generating new insight and hypothesis making. Sequence Bundles is well disposed for future implementation as an interactive visual analytics software, which can complement existing visualisation tools. PMID:25237395

  16. Templated sequence insertion polymorphisms in the human genome

    NASA Astrophysics Data System (ADS)

    Onozawa, Masahiro; Aplan, Peter

    2016-11-01

    Templated Sequence Insertion Polymorphism (TSIP) is a recently described form of polymorphism recognized in the human genome, in which a sequence that is templated from a distant genomic region is inserted into the genome, seemingly at random. TSIPs can be grouped into two classes based on nucleotide sequence features at the insertion junctions; Class 1 TSIPs show features of insertions that are mediated via the LINE-1 ORF2 protein, including 1) target-site duplication (TSD), 2) polyadenylation 10-30 nucleotides downstream of a “cryptic” polyadenylation signal, and 3) preference for insertion at a 5’-TTTT/A-3’ sequence. In contrast, class 2 TSIPs show features consistent with repair of a DNA double-strand break via insertion of a DNA “patch” that is derived from a distant genomic region. Survey of a large number of normal human volunteers demonstrates that most individuals have 25-30 TSIPs, and that these TSIPs track with specific geographic regions. Similar to other forms of human polymorphism, we suspect that these TSIPs may be important for the generation of human diversity and genetic diseases.

  17. Near-Term Fetuses Process Temporal Features of Speech

    ERIC Educational Resources Information Center

    Granier-Deferre, Carolyn; Ribeiro, Aurelie; Jacquet, Anne-Yvonne; Bassereau, Sophie

    2011-01-01

    The perception of speech and music requires processing of variations in spectra and amplitude over different time intervals. Near-term fetuses can discriminate acoustic features, such as frequencies and spectra, but whether they can process complex auditory streams, such as speech sequences and more specifically their temporal variations, fast or…

  18. Distribution and Features of the Six Classes of Peroxiredoxins

    PubMed Central

    Poole, Leslie B.; Nelson, Kimberly J.

    2016-01-01

    Peroxiredoxins are cysteine-dependent peroxide reductases that group into 6 different, structurally discernable classes. In 2011, our research team reported the application of a bioinformatic approach called active site profiling to extract active site-proximal sequence segments from the 29 distinct, structurally-characterized peroxiredoxins available at the time. These extracted sequences were then used to create unique profiles for the six groups which were subsequently used to search GenBank(nr), allowing identification of ∼3500 peroxiredoxin sequences and their respective subgroups. Summarized in this minireview are the features and phylogenetic distributions of each of these peroxiredoxin subgroups; an example is also provided illustrating the use of the web accessible, searchable database known as PREX to identify subfamily-specific peroxiredoxin sequences for the organism Vitis vinifera (grape). PMID:26810075

  19. Influenza virus sequence feature variant type analysis: evidence of a role for NS1 in influenza virus host range restriction.

    PubMed

    Noronha, Jyothi M; Liu, Mengya; Squires, R Burke; Pickett, Brett E; Hale, Benjamin G; Air, Gillian M; Galloway, Summer E; Takimoto, Toru; Schmolke, Mirco; Hunt, Victoria; Klem, Edward; García-Sastre, Adolfo; McGee, Monnie; Scheuermann, Richard H

    2012-05-01

    Genetic drift of influenza virus genomic sequences occurs through the combined effects of sequence alterations introduced by a low-fidelity polymerase and the varying selective pressures experienced as the virus migrates through different host environments. While traditional phylogenetic analysis is useful in tracking the evolutionary heritage of these viruses, the specific genetic determinants that dictate important phenotypic characteristics are often difficult to discern within the complex genetic background arising through evolution. Here we describe a novel influenza virus sequence feature variant type (Flu-SFVT) approach, made available through the public Influenza Research Database resource (www.fludb.org), in which variant types (VTs) identified in defined influenza virus protein sequence features (SFs) are used for genotype-phenotype association studies. Since SFs have been defined for all influenza virus proteins based on known structural, functional, and immune epitope recognition properties, the Flu-SFVT approach allows the rapid identification of the molecular genetic determinants of important influenza virus characteristics and their connection to underlying biological functions. We demonstrate the use of the SFVT approach to obtain statistical evidence for effects of NS1 protein sequence variations in dictating influenza virus host range restriction.

  20. Protecting genomic sequence anonymity with generalization lattices.

    PubMed

    Malin, B A

    2005-01-01

    Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual's identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences. The technique is termed DNA lattice anonymization (DNALA), and is based upon the formal privacy protection schema of k -anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines). The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy. The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific datasharing scenarios.

  1. Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.

    PubMed

    Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai

    2015-12-01

    The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences. Copyright © 2015 Elsevier Ltd. All rights reserved.

  2. Sequence-based genotyping clarifies conflicting historical morphometric and biological data for 5 Eimeria species infecting turkeys.

    PubMed

    El-Sherry, S; Ogedengbe, M E; Hafeez, M A; Sayf-Al-Din, M; Gad, N; Barta, J R

    2015-02-01

    Unlike with Eimeria species infecting chickens, specific identification and nomenclature of Eimeria species infecting turkeys is complicated, and in the absence of molecular data, imprecise. In an attempt to reconcile contradictory data reported on oocyst morphometrics and biological descriptions of various Eimeria species infecting turkey, we established single oocyst derived lines of 5 important Eimeria species infecting turkeys, Eimeria meleagrimitis (USMN08-01 strain), Eimeria adenoeides (Guelph strain), Eimeria gallopavonis (Weybridge strain), Eimeria meleagridis (USAR97-01 strain), and Eimeria dispersa (Briston strain). Short portions (514 bp) of mitochondrial cytochrome c oxidase subunit I gene (mt COI) from each were amplified and sequenced. Comparison of these sequences showed sufficient species-specific sequence variation to recommend these short mt COI sequences as species-specific markers. Uniformity of oocyst features (dimensions and oocyst structure) of each pure line was observed. Additional morphological features of the oocysts of these species are described as useful for the microscopic differentiation of these Eimeria species. Combined molecular and morphometric data on these single species lines compared with the original species descriptions and more recent data have helped to clarify some confusing, and sometimes conflicting, features associated with these Eimeria spp. For example, these new data suggest that the KCH and KR strains of E. adenoeides reported previously represent 2 distinct species, E. adenoeides and E. meleagridis, respectively. Likewise, analysis of the Weybridge strain of E. adenoeides, which has long been used as a reference strain in various studies conducted on the pathogenicity of E. adenoeides, indicates that this coccidium is actually a strain of E. gallopavonis. We highly recommend mt COI sequence-based genotyping be incorporated into all studies using Eimeria spp. of turkeys to confirm species identifications and so that any resulting data can be associated correctly with a single named Eimeria species. © 2015 Poultry Science Association Inc.

  3. TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine.

    PubMed

    Li, Guang-Qing; Liu, Zi; Shen, Hong-Bin; Yu, Dong-Jun

    2016-10-01

    As one of the most ubiquitous post-transcriptional modifications of RNA, N 6 -methyladenosine ( [Formula: see text]) plays an essential role in many vital biological processes. The identification of [Formula: see text] sites in RNAs is significantly important for both basic biomedical research and practical drug development. In this study, we designed a computational-based method, called TargetM6A, to rapidly and accurately target [Formula: see text] sites solely from the primary RNA sequences. Two new features, i.e., position-specific nucleotide/dinucleotide propensities (PSNP/PSDP), are introduced and combined with the traditional nucleotide composition (NC) feature to formulate RNA sequences. The extracted features are further optimized to obtain a much more compact and discriminative feature subset by applying an incremental feature selection (IFS) procedure. Based on the optimized feature subset, we trained TargetM6A on the training dataset with a support vector machine (SVM) as the prediction engine. We compared the proposed TargetM6A method with existing methods for predicting [Formula: see text] sites by performing stringent jackknife tests and independent validation tests on benchmark datasets. The experimental results show that the proposed TargetM6A method outperformed the existing methods for predicting [Formula: see text] sites and remarkably improved the prediction performances, with MCC = 0.526 and AUC = 0.818. We also provided a user-friendly web server for TargetM6A, which is publicly accessible for academic use at http://csbio.njust.edu.cn/bioinf/TargetM6A.

  4. Picture Detection in Rapid Serial Visual Presentation: Features or Identity?

    ERIC Educational Resources Information Center

    Potter, Mary C.; Wyble, Brad; Pandav, Rijuta; Olejarczyk, Jennifer

    2010-01-01

    A pictured object can be readily detected in a rapid serial visual presentation sequence when the target is specified by a superordinate category name such as "animal" or "vehicle". Are category features the initial basis for detection, with identification of the specific object occurring in a second stage (Evans &…

  5. Identification of S-glutathionylation sites in species-specific proteins by incorporating five sequence-derived features into the general pseudo-amino acid composition.

    PubMed

    Zhao, Xiaowei; Ning, Qiao; Ai, Meiyue; Chai, Haiting; Yang, Guifu

    2016-06-07

    As a selective and reversible protein post-translational modification, S-glutathionylation generates mixed disulfides between glutathione (GSH) and cysteine residues, and plays an important role in regulating protein activity, stability, and redox regulation. To fully understand S-glutathionylation mechanisms, identification of substrates and specific S-Glutathionylated sites is crucial. Experimental identification of S-glutathionylated sites is labor-intensive and time consuming, so establishing an effective computational method is much desirable due to their convenient and fast speed. Therefore, in this study, a new bioinformatics tool named SSGlu (Species-Specific identification of Protein S-glutathionylation Sites) was developed to identify species-specific protein S-glutathionylated sites, utilizing support vector machines that combine multiple sequence-derived features with a two-step feature selection. By 5-fold cross validation, the performance of SSGlu was measured with an AUC of 0.8105 and 0.8041 for Homo sapiens and Mus musculus, respectively. Additionally, SSGlu was compared with the existing methods, and the higher MCC and AUC of SSGlu demonstrated that SSGlu was very promising to predict S-glutathionylated sites. Furthermore, a site-specific analysis showed that S-glutathionylation intimately correlated with the features derived from its surrounding sites. The conclusions derived from this study might help to understand more of the S-glutathionylation mechanism and guide the related experimental validation. For public access, SSGlu is freely accessible at http://59.73.198.144:8080/SSGlu/. Copyright © 2016 Elsevier Ltd. All rights reserved.

  6. Proteins without unique 3D structures: biotechnological applications of intrinsically unstable/disordered proteins.

    PubMed

    Uversky, Vladimir N

    2015-03-01

    Intrinsically disordered proteins (IDPs) and intrinsically disordered protein regions (IDPRs) are functional proteins or regions that do not have unique 3D structures under functional conditions. Therefore, from the viewpoint of their lack of stable 3D structure, IDPs/IDPRs are inherently unstable. As much as structure and function of normal ordered globular proteins are determined by their amino acid sequences, the lack of unique 3D structure in IDPs/IDPRs and their disorder-based functionality are also encoded in the amino acid sequences. Because of their specific sequence features and distinctive conformational behavior, these intrinsically unstable proteins or regions have several applications in biotechnology. This review introduces some of the most characteristic features of IDPs/IDPRs (such as peculiarities of amino acid sequences of these proteins and regions, their major structural features, and peculiar responses to changes in their environment) and describes how these features can be used in the biotechnology, for example for the proteome-wide analysis of the abundance of extended IDPs, for recombinant protein isolation and purification, as polypeptide nanoparticles for drug delivery, as solubilization tools, and as thermally sensitive carriers of active peptides and proteins. Copyright © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  7. Sequence-structural features and evolutionary relationships of family GH57 α-amylases and their putative α-amylase-like homologues.

    PubMed

    Janeček, Stefan; Blesák, Karol

    2011-08-01

    The glycoside hydrolase family 57 (GH57) contains α-amylase and a few other amylolytic specificities. It counts ~400 members from Archaea (1/4) and Bacteria (3/4), mostly of extremophilic prokaryotes. Only 17 GH57 enzymes have been biochemically characterized. The main goal of the present bioinformatics study was to analyze sequences having the clear GH57 α-amylase features. Of the 107 GH57 sequences, 59 were evaluated as α-amylases (containing both GH57 catalytic residues), whereas 48 were assigned as GH57 α-amylase-like proteins (having a substitution in one or both catalytic residues). Forty-eight of 59 α-amylases were from Archaea, but 42 of 48 α-amylase-like proteins were of bacterial origin. The catalytic residues were substituted in most cases in Bacteroides and Prevotella by serine (instead of catalytic nucleophile glutamate) and glutamate (instead of proton donor aspartate). The GH57 α-amylase specificity has thus been evolved and kept enzymatically active mainly in Archaea.

  8. Identification of upstream and intragenic regulatory elements that confer cell-type-restricted and differentiation-specific expression on the muscle creatine kinase gene

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sternberg, E.A.; Spizz, G.; Perry, W.M.

    1988-07-01

    Terminal differentiation of skeletal myobalsts is accompanied by induction of a series of tissue-specific gene products, which includes the muscle isoenzymte of creatine kinase (MCK). To begin to define the sequences and signals involved in MCK regulation in developing muscle cells, the mouse MCK gene has been isolated. Sequence analysis of 4,147 bases of DNA surrounding the transcription initiation site revealed several interesting structural features, some of which are common to other muscle-specific genes and to cellular and viral enhancers.

  9. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel).

    PubMed

    Douville, Christopher; Masica, David L; Stenson, Peter D; Cooper, David N; Gygax, Derek M; Kim, Rick; Ryan, Michael; Karchin, Rachel

    2016-01-01

    Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method. © 2015 The Authors. **Human Mutation published by Wiley Periodicals, Inc.

  10. A Comprehensive Approach to Sequence-oriented IsomiR annotation (CASMIR): demonstration with IsomiR profiling in colorectal neoplasia.

    PubMed

    Wu, Chung Wah; Evans, Jared M; Huang, Shengbing; Mahoney, Douglas W; Dukek, Brian A; Taylor, William R; Yab, Tracy C; Smyrk, Thomas C; Jen, Jin; Kisiel, John B; Ahlquist, David A

    2018-05-25

    MicroRNA (miRNA) profiling is an important step in studying biological associations and identifying marker candidates. miRNA exists in isoforms, called isomiRs, which may exhibit distinct properties. With conventional profiling methods, limitations in assay and analysis platforms may compromise isomiR interrogation. We introduce a comprehensive approach to sequence-oriented isomiR annotation (CASMIR) to allow unbiased identification of global isomiRs from small RNA sequencing data. In this approach, small RNA reads are maintained as independent sequences instead of being summarized under miRNA names. IsomiR features are identified through step-wise local alignment against canonical forms and precursor sequences. Through customizing the reference database, CASMIR is applicable to isomiR annotation across species. To demonstrate its application, we investigated isomiR profiles in normal and neoplastic human colorectal epithelia. We also ran miRDeep2, a popular miRNA analysis algorithm to validate isomiRs annotated by CASMIR. With CASMIR, specific and biologically relevant isomiR patterns could be identified. We note that specific isomiRs are often more abundant than their canonical forms. We identify isomiRs that are commonly up-regulated in both colorectal cancer and advanced adenoma, and illustrate advantages in targeting isomiRs as potential biomarkers over canonical forms. Studying miRNAs at the isomiR level could reveal new insight into miRNA biology and inform assay design for specific isomiRs. CASMIR facilitates comprehensive annotation of isomiR features in small RNA sequencing data for isomiR profiling and differential expression analysis.

  11. Isolation and characterization of target sequences of the chicken CdxA homeobox gene.

    PubMed Central

    Margalit, Y; Yarus, S; Shapira, E; Gruenbaum, Y; Fainsod, A

    1993-01-01

    The DNA binding specificity of the chicken homeodomain protein CDXA was studied. Using a CDXA-glutathione-S-transferase fusion protein, DNA fragments containing the binding site for this protein were isolated. The sources of DNA were oligonucleotides with random sequence and chicken genomic DNA. The DNA fragments isolated were sequenced and tested in DNA binding assays. Sequencing revealed that most DNA fragments are AT rich which is a common feature of homeodomain binding sites. By electrophoretic mobility shift assays it was shown that the different target sequences isolated bind to the CDXA protein with different affinities. The specific sequences bound by the CDXA protein in the genomic fragments isolated, were determined by DNase I footprinting. From the footprinted sequences, the CDXA consensus binding site was determined. The CDXA protein binds the consensus sequence A, A/T, T, A/T, A, T, A/G. The CAUDAL binding site in the ftz promoter is also included in this consensus sequence. When tested, some of the genomic target sequences were capable of enhancing the transcriptional activity of reporter plasmids when introduced into CDXA expressing cells. This study determined the DNA sequence specificity of the CDXA protein and it also shows that this protein can further activate transcription in cells in culture. Images PMID:7909943

  12. Influence of DNA sequence on the structure of minicircles under torsional stress

    PubMed Central

    Wang, Qian; Irobalieva, Rossitza N.; Chiu, Wah; Schmid, Michael F.; Fogg, Jonathan M.; Zechiedrich, Lynn

    2017-01-01

    Abstract The sequence dependence of the conformational distribution of DNA under various levels of torsional stress is an important unsolved problem. Combining theory and coarse-grained simulations shows that the DNA sequence and a structural correlation due to topology constraints of a circle are the main factors that dictate the 3D structure of a 336 bp DNA minicircle under torsional stress. We found that DNA minicircle topoisomers can have multiple bend locations under high torsional stress and that the positions of these sharp bends are determined by the sequence, and by a positive mechanical correlation along the sequence. We showed that simulations and theory are able to provide sequence-specific information about individual DNA minicircles observed by cryo-electron tomography (cryo-ET). We provided a sequence-specific cryo-ET tomogram fitting of DNA minicircles, registering the sequence within the geometric features. Our results indicate that the conformational distribution of minicircles under torsional stress can be designed, which has important implications for using minicircle DNA for gene therapy. PMID:28609782

  13. Nucleic acid sequence detection using multiplexed oligonucleotide PCR

    DOEpatents

    Nolan, John P [Santa Fe, NM; White, P Scott [Los Alamos, NM

    2006-12-26

    Methods for rapidly detecting single or multiple sequence alleles in a sample nucleic acid are described. Provided are all of the oligonucleotide pairs capable of annealing specifically to a target allele and discriminating among possible sequences thereof, and ligating to each other to form an oligonucleotide complex when a particular sequence feature is present (or, alternatively, absent) in the sample nucleic acid. The design of each oligonucleotide pair permits the subsequent high-level PCR amplification of a specific amplicon when the oligonucleotide complex is formed, but not when the oligonucleotide complex is not formed. The presence or absence of the specific amplicon is used to detect the allele. Detection of the specific amplicon may be achieved using a variety of methods well known in the art, including without limitation, oligonucleotide capture onto DNA chips or microarrays, oligonucleotide capture onto beads or microspheres, electrophoresis, and mass spectrometry. Various labels and address-capture tags may be employed in the amplicon detection step of multiplexed assays, as further described herein.

  14. Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape.

    PubMed

    Dai, Hanjun; Umarov, Ramzan; Kuwahara, Hiroyuki; Li, Yu; Song, Le; Gao, Xin

    2017-11-15

    An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods. Our program is freely available at https://github.com/ramzan1990/sequence2vec. xin.gao@kaust.edu.sa or lsong@cc.gatech.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  15. Object Tracking Using Adaptive Covariance Descriptor and Clustering-Based Model Updating for Visual Surveillance

    PubMed Central

    Qin, Lei; Snoussi, Hichem; Abdallah, Fahed

    2014-01-01

    We propose a novel approach for tracking an arbitrary object in video sequences for visual surveillance. The first contribution of this work is an automatic feature extraction method that is able to extract compact discriminative features from a feature pool before computing the region covariance descriptor. As the feature extraction method is adaptive to a specific object of interest, we refer to the region covariance descriptor computed using the extracted features as the adaptive covariance descriptor. The second contribution is to propose a weakly supervised method for updating the object appearance model during tracking. The method performs a mean-shift clustering procedure among the tracking result samples accumulated during a period of time and selects a group of reliable samples for updating the object appearance model. As such, the object appearance model is kept up-to-date and is prevented from contamination even in case of tracking mistakes. We conducted comparing experiments on real-world video sequences, which confirmed the effectiveness of the proposed approaches. The tracking system that integrates the adaptive covariance descriptor and the clustering-based model updating method accomplished stable object tracking on challenging video sequences. PMID:24865883

  16. The complete genome sequencing of Prevotella intermedia strain OMA14 and a subsequent fine-scale, intra-species genomic comparison reveal an unusual amplification of conjugative and mobile transposons and identify a novel Prevotella-lineage-specific repeat

    PubMed Central

    Naito, Mariko; Ogura, Yoshitoshi; Itoh, Takehiko; Shoji, Mikio; Okamoto, Masaaki; Hayashi, Tetsuya; Nakayama, Koji

    2016-01-01

    Prevotella intermedia is a pathogenic bacterium involved in periodontal diseases. Here, we present the complete genome sequence of a clinical strain, OMA14, of this bacterium along with the results of comparative genome analysis with strain 17 of the same species whose genome has also been sequenced, but not fully analysed yet. The genomes of both strains consist of two circular chromosomes: the larger chromosomes are similar in size and exhibit a high overall linearity of gene organizations, whereas the smaller chromosomes show a significant size variation and have undergone remarkable genome rearrangements. Unique features of the Pre. intermedia genomes are the presence of a remarkable number of essential genes on the second chromosomes and the abundance of conjugative and mobilizable transposons (CTns and MTns). The CTns/MTns are particularly abundant in the second chromosomes, involved in its extensive genome rearrangement, and have introduced a number of strain-specific genes into each strain. We also found a novel 188-bp repeat sequence that has been highly amplified in Pre. intermedia and are specifically distributed among the Pre. intermedia-related species. These findings expand our understanding of the genetic features of Pre. intermedia and the roles of CTns and MTns in the evolution of bacteria. PMID:26645327

  17. Uncultivated Microbial Eukaryotic Diversity: A Method to Link ssu rRNA Gene Sequences with Morphology

    PubMed Central

    Hirst, Marissa B.; Kita, Kelley N.; Dawson, Scott C.

    2011-01-01

    Protists have traditionally been identified by cultivation and classified taxonomically based on their cellular morphologies and behavior. In the past decade, however, many novel protist taxa have been identified using cultivation independent ssu rRNA sequence surveys. New rRNA “phylotypes” from uncultivated eukaryotes have no connection to the wealth of prior morphological descriptions of protists. To link phylogenetically informative sequences with taxonomically informative morphological descriptions, we demonstrate several methods for combining whole cell rRNA-targeted fluorescent in situ hybridization (FISH) with cytoskeletal or organellar immunostaining. Either eukaryote or ciliate-specific ssu rRNA probes were combined with an anti-α-tubulin antibody or phalloidin, a common actin stain, to define cytoskeletal features of uncultivated protists in several environmental samples. The eukaryote ssu rRNA probe was also combined with Mitotracker® or a hydrogenosomal-specific anti-Hsp70 antibody to localize mitochondria and hydrogenosomes, respectively, in uncultivated protists from different environments. Using rRNA probes in combination with immunostaining, we linked ssu rRNA phylotypes with microtubule structure to describe flagellate and ciliate morphology in three diverse environments, and linked Naegleria spp. to their amoeboid morphology using actin staining in hay infusion samples. We also linked uncultivated ciliates to morphologically similar Colpoda-like ciliates using tubulin immunostaining with a ciliate-specific rRNA probe. Combining rRNA-targeted FISH with cytoskeletal immunostaining or stains targeting specific organelles provides a fast, efficient, high throughput method for linking genetic sequences with morphological features in uncultivated protists. When linked to phylotype, morphological descriptions of protists can both complement and vet the increasing number of sequences from uncultivated protists, including those of novel lineages, identified in diverse environments. PMID:22174774

  18. tRNomics: analysis of tRNA genes from 50 genomes of Eukarya, Archaea, and Bacteria reveals anticodon-sparing strategies and domain-specific features.

    PubMed Central

    Marck, Christian; Grosjean, Henri

    2002-01-01

    From 50 genomes of the three domains of life (7 eukarya, 13 archaea, and 30 bacteria), we extracted, analyzed, and compared over 4,000 sequences corresponding to cytoplasmic, nonorganellar tRNAs. For each genome, the complete set of tRNAs required to read the 61 sense codons was identified, which permitted revelation of three major anticodon-sparing strategies. Other features and sequence peculiarities analyzed are the following: (1) fit to the standard cloverleaf structure, (2) characteristic consensus sequences for elongator and initiator tDNAs, (3) frequencies of bases at each sequence position, (4) type and frequencies of conserved 2D and 3D base pairs, (5) anticodon/tDNA usages and anticodon-sparing strategies, (6) identification of the tRNA-Ile with anticodon CAU reading AUA, (7) size of variable arm, (8) occurrence and location of introns, (9) occurrence of 3'-CCA and 5'-extra G encoded at the tDNA level, and (10) distribution of the tRNA genes in genomes and their mode of transcription. Among all tRNA isoacceptors, we found that initiator tDNA-iMet is the most conserved across the three domains, yet domain-specific signatures exist. Also, according to which tRNA feature is considered (5'-extra G encoded in tDNAs-His, AUA codon read by tRNA-Ile with anticodon CAU, presence of intron, absence of "two-out-of-three" reading mode and short V-arm in tDNA-Tyr) Archaea sequester either with Bacteria or Eukarya. No common features between Eukarya and Bacteria not shared with Archaea could be unveiled. Thus, from the tRNomic point of view, Archaea appears as an "intermediate domain" between Eukarya and Bacteria. PMID:12403461

  19. Dynamic visual attention: motion direction versus motion magnitude

    NASA Astrophysics Data System (ADS)

    Bur, A.; Wurtz, P.; Müri, R. M.; Hügli, H.

    2008-02-01

    Defined as an attentive process in the context of visual sequences, dynamic visual attention refers to the selection of the most informative parts of video sequence. This paper investigates the contribution of motion in dynamic visual attention, and specifically compares computer models designed with the motion component expressed either as the speed magnitude or as the speed vector. Several computer models, including static features (color, intensity and orientation) and motion features (magnitude and vector) are considered. Qualitative and quantitative evaluations are performed by comparing the computer model output with human saliency maps obtained experimentally from eye movement recordings. The model suitability is evaluated in various situations (synthetic and real sequences, acquired with fixed and moving camera perspective), showing advantages and inconveniences of each method as well as preferred domain of application.

  20. Prediction of phenotypes of missense mutations in human proteins from biological assemblies.

    PubMed

    Wei, Qiong; Xu, Qifang; Dunbrack, Roland L

    2013-02-01

    Single nucleotide polymorphisms (SNPs) are the most frequent variation in the human genome. Nonsynonymous SNPs that lead to missense mutations can be neutral or deleterious, and several computational methods have been presented that predict the phenotype of human missense mutations. These methods use sequence-based and structure-based features in various combinations, relying on different statistical distributions of these features for deleterious and neutral mutations. One structure-based feature that has not been studied significantly is the accessible surface area within biologically relevant oligomeric assemblies. These assemblies are different from the crystallographic asymmetric unit for more than half of X-ray crystal structures. We find that mutations in the core of proteins or in the interfaces in biological assemblies are significantly more likely to be disease-associated than those on the surface of the biological assemblies. For structures with more than one protein in the biological assembly (whether the same sequence or different), we find the accessible surface area from biological assemblies provides a statistically significant improvement in prediction over the accessible surface area of monomers from protein crystal structures (P = 6e-5). When adding this information to sequence-based features such as the difference between wildtype and mutant position-specific profile scores, the improvement from biological assemblies is statistically significant but much smaller (P = 0.018). Combining this information with sequence-based features in a support vector machine leads to 82% accuracy on a balanced dataset of 50% disease-associated mutations from SwissVar and 50% neutral mutations from human/primate sequence differences in orthologous proteins. Copyright © 2012 Wiley Periodicals, Inc.

  1. Statistical properties of DNA sequences

    NASA Technical Reports Server (NTRS)

    Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Simons, M.; Stanley, H. E.

    1995-01-01

    We review evidence supporting the idea that the DNA sequence in genes containing non-coding regions is correlated, and that the correlation is remarkably long range--indeed, nucleotides thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationarity" feature of the sequence of base pairs by applying a new algorithm called detrended fluctuation analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and non-coding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33301 coding and 29453 non-coding) in the entire GenBank database. Finally, we describe briefly some recent work showing that the non-coding sequences have certain statistical features in common with natural and artificial languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts. These statistical properties of non-coding sequences support the possibility that non-coding regions of DNA may carry biological information.

  2. Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach

    PubMed Central

    Pal Choudhury, Pabitra

    2017-01-01

    Periplasmic c7 type cytochrome A (PpcA) protein is determined in Geobacter sulfurreducens along with its other four homologs (PpcB-E). From the crystal structure viewpoint the observation emerges that PpcA protein can bind with Deoxycholate (DXCA), while its other homologs do not. But it is yet to be established with certainty the reason behind this from primary protein sequence information. This study is primarily based on primary protein sequence analysis through the chemical basis of embedded amino acids. Firstly, we look for the chemical group specific score of amino acids. Along with this, we have developed a new methodology for the phylogenetic analysis based on chemical group dissimilarities of amino acids. This new methodology is applied to the cytochrome c7 family members and pinpoint how a particular sequence is differing with others. Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. Thirdly, we search for unique patterns as subsequences which are common among the group or specific individual member. In all the cases, we are able to show some distinct features of PpcA that emerges PpcA as an outstanding protein compared to its other homologs, resulting towards its binding with deoxycholate. Similarly, some notable features for the structurally dissimilar protein PpcD compared to the other homologs are also brought out. Further, the five members of cytochrome family being homolog proteins, they must have some common significant features which are also enumerated in this study. PMID:28362850

  3. Vander Lugt correlation of DNA sequence data

    NASA Astrophysics Data System (ADS)

    Christens-Barry, William A.; Hawk, James F.; Martin, James C.

    1990-12-01

    DNA, the molecule containing the genetic code of an organism, is a linear chain of subunits. It is the sequence of subunits, of which there are four kinds, that constitutes the unique blueprint of an individual. This sequence is the focus of a large number of analyses performed by an army of geneticists, biologists, and computer scientists. Most of these analyses entail searches for specific subsequences within the larger set of sequence data. Thus, most analyses are essentially pattern recognition or correlation tasks. Yet, there are special features to such analysis that influence the strategy and methods of an optical pattern recognition approach. While the serial processing employed in digital electronic computers remains the main engine of sequence analyses, there is no fundamental reason that more efficient parallel methods cannot be used. We describe an approach using optical pattern recognition (OPR) techniques based on matched spatial filtering. This allows parallel comparison of large blocks of sequence data. In this study we have simulated a Vander Lugt1 architecture implementing our approach. Searches for specific target sequence strings within a block of DNA sequence from the Co/El plasmid2 are performed.

  4. Mining SNPs from EST sequences using filters and ensemble classifiers.

    PubMed

    Wang, J; Zou, Q; Guo, M Z

    2010-05-04

    Abundant single nucleotide polymorphisms (SNPs) provide the most complete information for genome-wide association studies. However, due to the bottleneck of manual discovery of putative SNPs and the inaccessibility of the original sequencing reads, it is essential to develop a more efficient and accurate computational method for automated SNP detection. We propose a novel computational method to rapidly find true SNPs in public-available EST (expressed sequence tag) databases; this method is implemented as SNPDigger. EST sequences are clustered and aligned. SNP candidates are then obtained according to a measure of redundant frequency. Several new informative biological features, such as the structural neighbor profiles and the physical position of the SNP, were extracted from EST sequences, and the effectiveness of these features was demonstrated. An ensemble classifier, which employs a carefully selected feature set, was included for the imbalanced training data. The sensitivity and specificity of our method both exceeded 80% for human genetic data in the cross validation. Our method enables detection of SNPs from the user's own EST dataset and can be used on species for which there is no genome data. Our tests showed that this method can effectively guide SNP discovery in ESTs and will be useful to avoid and save the cost of biological analyses.

  5. DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues.

    PubMed

    Ma, Xin; Guo, Jing; Sun, Xiao

    2016-01-01

    DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.

  6. Prediction of plant pre-microRNAs and their microRNAs in genome-scale sequences using structure-sequence features and support vector machine.

    PubMed

    Meng, Jun; Liu, Dong; Sun, Chao; Luan, Yushi

    2014-12-30

    MicroRNAs (miRNAs) are a family of non-coding RNAs approximately 21 nucleotides in length that play pivotal roles at the post-transcriptional level in animals, plants and viruses. These molecules silence their target genes by degrading transcription or suppressing translation. Studies have shown that miRNAs are involved in biological responses to a variety of biotic and abiotic stresses. Identification of these molecules and their targets can aid the understanding of regulatory processes. Recently, prediction methods based on machine learning have been widely used for miRNA prediction. However, most of these methods were designed for mammalian miRNA prediction, and few are available for predicting miRNAs in the pre-miRNAs of specific plant species. Although the complete Solanum lycopersicum genome has been published, only 77 Solanum lycopersicum miRNAs have been identified, far less than the estimated number. Therefore, it is essential to develop a prediction method based on machine learning to identify new plant miRNAs. A novel classification model based on a support vector machine (SVM) was trained to identify real and pseudo plant pre-miRNAs together with their miRNAs. An initial set of 152 novel features related to sequential structures was used to train the model. By applying feature selection, we obtained the best subset of 47 features for use with the Back Support Vector Machine-Recursive Feature Elimination (B-SVM-RFE) method for the classification of plant pre-miRNAs. Using this method, 63 features were obtained for plant miRNA classification. We then developed an integrated classification model, miPlantPreMat, which comprises MiPlantPre and MiPlantMat, to identify plant pre-miRNAs and their miRNAs. This model achieved approximately 90% accuracy using plant datasets from nine plant species, including Arabidopsis thaliana, Glycine max, Oryza sativa, Physcomitrella patens, Medicago truncatula, Sorghum bicolor, Arabidopsis lyrata, Zea mays and Solanum lycopersicum. Using miPlantPreMat, 522 Solanum lycopersicum miRNAs were identified in the Solanum lycopersicum genome sequence. We developed an integrated classification model, miPlantPreMat, based on structure-sequence features and SVM. MiPlantPreMat was used to identify both plant pre-miRNAs and the corresponding mature miRNAs. An improved feature selection method was proposed, resulting in high classification accuracy, sensitivity and specificity.

  7. STAT1:DNA sequence-dependent binding modulation by phosphorylation, protein:protein interactions and small-molecule inhibition

    PubMed Central

    Bonham, Andrew J.; Wenta, Nikola; Osslund, Leah M.; Prussin, Aaron J.; Vinkemeier, Uwe; Reich, Norbert O.

    2013-01-01

    The DNA-binding specificity and affinity of the dimeric human transcription factor (TF) STAT1, were assessed by total internal reflectance fluorescence protein-binding microarrays (TIRF-PBM) to evaluate the effects of protein phosphorylation, higher-order polymerization and small-molecule inhibition. Active, phosphorylated STAT1 showed binding preferences consistent with prior characterization, whereas unphosphorylated STAT1 showed a weak-binding preference for one-half of the GAS consensus site, consistent with recent models of STAT1 structure and function in response to phosphorylation. This altered-binding preference was further tested by use of the inhibitor LLL3, which we show to disrupt STAT1 binding in a sequence-dependent fashion. To determine if this sequence-dependence is specific to STAT1 and not a general feature of human TF biology, the TF Myc/Max was analysed and tested with the inhibitor Mycro3. Myc/Max inhibition by Mycro3 is sequence independent, suggesting that the sequence-dependent inhibition of STAT1 may be specific to this system and a useful target for future inhibitor design. PMID:23180800

  8. Functional Testing of SLC26A4 Variants—Clinical and Molecular Analysis of a Cohort with Enlarged Vestibular Aqueduct from Austria

    PubMed Central

    Bernardinelli, Emanuele; Nofziger, Charity; Patsch, Wolfgang; Rasp, Gerd; Paulmichl, Markus; Dossena, Silvia

    2018-01-01

    The prevalence and spectrum of sequence alterations in the SLC26A4 gene, which codes for the anion exchanger pendrin, are population-specific and account for at least 50% of cases of non-syndromic hearing loss associated with an enlarged vestibular aqueduct. A cohort of nineteen patients from Austria with hearing loss and a radiological alteration of the vestibular aqueduct underwent Sanger sequencing of SLC26A4 and GJB2, coding for connexin 26. The pathogenicity of sequence alterations detected was assessed by determining ion transport and molecular features of the corresponding SLC26A4 protein variants. In this group, four uncharacterized sequence alterations within the SLC26A4 coding region were found. Three of these lead to protein variants with abnormal functional and molecular features, while one should be considered with no pathogenic potential. Pathogenic SLC26A4 sequence alterations were only found in 12% of patients. SLC26A4 sequence alterations commonly found in other Caucasian populations were not detected. This survey represents the first study on the prevalence and spectrum of SLC26A4 sequence alterations in an Austrian cohort and further suggests that genetic testing should always be integrated with functional characterization and determination of the molecular features of protein variants in order to unequivocally identify or exclude a causal link between genotype and phenotype. PMID:29320412

  9. HMPAS: Human Membrane Protein Analysis System

    PubMed Central

    2013-01-01

    Background Membrane proteins perform essential roles in diverse cellular functions and are regarded as major pharmaceutical targets. The significance of membrane proteins has led to the developing dozens of resources related with membrane proteins. However, most of these resources are built for specific well-known membrane protein groups, making it difficult to find common and specific features of various membrane protein groups. Methods We collected human membrane proteins from the dispersed resources and predicted novel membrane protein candidates by using ortholog information and our membrane protein classifiers. The membrane proteins were classified according to the type of interaction with the membrane, subcellular localization, and molecular function. We also made new feature dataset to characterize the membrane proteins in various aspects including membrane protein topology, domain, biological process, disease, and drug. Moreover, protein structure and ICD-10-CM based integrated disease and drug information was newly included. To analyze the comprehensive information of membrane proteins, we implemented analysis tools to identify novel sequence and functional features of the classified membrane protein groups and to extract features from protein sequences. Results We constructed HMPAS with 28,509 collected known membrane proteins and 8,076 newly predicted candidates. This system provides integrated information of human membrane proteins individually and in groups organized by 45 subcellular locations and 1,401 molecular functions. As a case study, we identified associations between the membrane proteins and diseases and present that membrane proteins are promising targets for diseases related with nervous system and circulatory system. A web-based interface of this system was constructed to facilitate researchers not only to retrieve organized information of individual proteins but also to use the tools to analyze the membrane proteins. Conclusions HMPAS provides comprehensive information about human membrane proteins including specific features of certain membrane protein groups. In this system, user can acquire the information of individual proteins and specified groups focused on their conserved sequence features, involved cellular processes, and diseases. HMPAS may contribute as a valuable resource for the inference of novel cellular mechanisms and pharmaceutical targets associated with the human membrane proteins. HMPAS is freely available at http://fcode.kaist.ac.kr/hmpas. PMID:24564858

  10. The complete genome sequencing of Prevotella intermedia strain OMA14 and a subsequent fine-scale, intra-species genomic comparison reveal an unusual amplification of conjugative and mobile transposons and identify a novel Prevotella-lineage-specific repeat.

    PubMed

    Naito, Mariko; Ogura, Yoshitoshi; Itoh, Takehiko; Shoji, Mikio; Okamoto, Masaaki; Hayashi, Tetsuya; Nakayama, Koji

    2016-02-01

    Prevotella intermedia is a pathogenic bacterium involved in periodontal diseases. Here, we present the complete genome sequence of a clinical strain, OMA14, of this bacterium along with the results of comparative genome analysis with strain 17 of the same species whose genome has also been sequenced, but not fully analysed yet. The genomes of both strains consist of two circular chromosomes: the larger chromosomes are similar in size and exhibit a high overall linearity of gene organizations, whereas the smaller chromosomes show a significant size variation and have undergone remarkable genome rearrangements. Unique features of the Pre. intermedia genomes are the presence of a remarkable number of essential genes on the second chromosomes and the abundance of conjugative and mobilizable transposons (CTns and MTns). The CTns/MTns are particularly abundant in the second chromosomes, involved in its extensive genome rearrangement, and have introduced a number of strain-specific genes into each strain. We also found a novel 188-bp repeat sequence that has been highly amplified in Pre. intermedia and are specifically distributed among the Pre. intermedia-related species. These findings expand our understanding of the genetic features of Pre. intermedia and the roles of CTns and MTns in the evolution of bacteria. © The Author 2015. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  11. RISSC: a novel database for ribosomal 16S–23S RNA genes spacer regions

    PubMed Central

    García-Martínez, Jesús; Bescós, Ignacio; Rodríguez-Sala, Jesús Javier; Rodríguez-Valera, Francisco

    2001-01-01

    A novel database, under the acronym RISSC (Ribosomal Intergenic Spacer Sequence Collection), has been created. It compiles more than 1600 entries of edited DNA sequence data from the 16S–23S ribosomal spacers present in most prokaryotes and organelles (e.g. mitochondria and chloroplasts) and is accessible through the Internet (http://ulises.umh.es/RISSC), where systematic searches for specific words can be conducted, as well as BLAST-type sequence searches. Additionally, a characteristic feature of this region, the presence/absence and nature of tRNA genes within the spacer, is included in all the entries, even when not previously indicated in the original database. All these combined features could provide a useful documen­tation tool for studies on evolution, identification, typing and strain characterization, among others. PMID:11125084

  12. Centromere Binding and Evolution of Chromosomal Partition Systems in the Burkholderiales

    PubMed Central

    Passot, Fanny M.; Calderon, Virginie; Fichant, Gwennaele; Lane, David

    2012-01-01

    How split genomes arise and evolve in bacteria is poorly understood. Since each replicon of such genomes encodes a specific partition (Par) system, the evolution of Par systems could shed light on their evolution. The cystic fibrosis pathogen Burkholderia cenocepacia has three chromosomes (c1, c2, and c3) and one plasmid (pBC), whose compatibility depends on strictly specific interactions of the centromere sequences (parS) with their cognate binding proteins (ParB). However, the Par systems of B. cenocepacia c2, c3, and pBC share many features, suggesting that they arose within an extended family. Database searching revealed seven subfamilies of Par systems like those of B. cenocepacia. All are from plasmids and secondary chromosomes of the Burkholderiales, which reinforces the proposal of an extended family. The subfamily of the Par system of B. cenocepacia c3 includes plasmid variants with parS sequences divergent from that of c3. Using electrophoretic mobility shift assay (EMSA), we found that ParB-c3 binds specifically to centromeres of these variants, despite high DNA sequence divergence. We suggest that the Par system of B. cenocepacia c3 has preserved the features of an ancestral system. In contrast, these features have diverged variably in the plasmid descendants. One such descendant is found both in Ralstonia pickettii 12D, on a free plasmid, and in Ralstonia pickettii 12J, on a plasmid integrated into the main chromosome. These observations suggest that we are witnessing a plasmid-chromosome interaction from which a third chromosome will emerge in a two-chromosome species. PMID:22522899

  13. Centromere binding and evolution of chromosomal partition systems in the Burkholderiales.

    PubMed

    Passot, Fanny M; Calderon, Virginie; Fichant, Gwennaele; Lane, David; Pasta, Franck

    2012-07-01

    How split genomes arise and evolve in bacteria is poorly understood. Since each replicon of such genomes encodes a specific partition (Par) system, the evolution of Par systems could shed light on their evolution. The cystic fibrosis pathogen Burkholderia cenocepacia has three chromosomes (c1, c2, and c3) and one plasmid (pBC), whose compatibility depends on strictly specific interactions of the centromere sequences (parS) with their cognate binding proteins (ParB). However, the Par systems of B. cenocepacia c2, c3, and pBC share many features, suggesting that they arose within an extended family. Database searching revealed seven subfamilies of Par systems like those of B. cenocepacia. All are from plasmids and secondary chromosomes of the Burkholderiales, which reinforces the proposal of an extended family. The subfamily of the Par system of B. cenocepacia c3 includes plasmid variants with parS sequences divergent from that of c3. Using electrophoretic mobility shift assay (EMSA), we found that ParB-c3 binds specifically to centromeres of these variants, despite high DNA sequence divergence. We suggest that the Par system of B. cenocepacia c3 has preserved the features of an ancestral system. In contrast, these features have diverged variably in the plasmid descendants. One such descendant is found both in Ralstonia pickettii 12D, on a free plasmid, and in Ralstonia pickettii 12J, on a plasmid integrated into the main chromosome. These observations suggest that we are witnessing a plasmid-chromosome interaction from which a third chromosome will emerge in a two-chromosome species.

  14. iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties

    PubMed Central

    Feng, Peng-Mian; Ding, Chen; Zuo, Yong-Chun; Chou, Kuo-Chen

    2012-01-01

    Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences. As a web-server, iNuc-PhysChem is freely accessible to the public at http://lin.uestc.edu.cn/server/iNuc-PhysChem. For the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented just for the integrity in developing the predictor. Meanwhile, for those who prefer to run predictions in their own computers, the predictor's code can be easily downloaded from the web-server. It is anticipated that iNuc-PhysChem may become a useful high throughput tool for both basic research and drug design. PMID:23144709

  15. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition.

    PubMed

    Hayat, Maqsood; Khan, Asifullah

    2011-02-21

    Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor. Copyright © 2010 Elsevier Ltd. All rights reserved.

  16. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST‐Indel)

    PubMed Central

    Douville, Christopher; Masica, David L.; Stenson, Peter D.; Cooper, David N.; Gygax, Derek M.; Kim, Rick; Ryan, Michael

    2015-01-01

    ABSTRACT Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features—DNA and protein sequence conservation, indel length, and occurrence in repeat regions—are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in‐frame and frameshift indels (VEST‐indel) as pathogenic or benign. We apply 24 features, including a new “PubMed” feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false‐positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta‐predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta‐predictor with improved performance over any individual method. PMID:26442818

  17. Expression of Immunoglobulin Receptors with Distinctive Features Indicating Antigen Selection by Marginal Zone B Cells from Human Spleen

    PubMed Central

    Colombo, Monica; Cutrona, Giovanna; Reverberi, Daniele; Bruno, Silvia; Ghiotto, Fabio; Tenca, Claudya; Stamatopoulos, Kostas; Hadzidimitriou, Anastasia; Ceccarelli, Jenny; Salvi, Sandra; Boccardo, Simona; Calevo, Maria Grazia; De Santanna, Amleto; Truini, Mauro; Fais, Franco; Ferrarini, Manlio

    2013-01-01

    Marginal zone (MZ) B cells, identified as surface (s)IgMhighsIgDlowCD23low/−CD21+CD38− B cells, were purified from human spleens, and the features of their V(D)J gene rearrangements were investigated and compared with those of germinal center (GC), follicular mantle (FM) and switched memory (SM) B cells. Most MZ B cells were CD27+ and exhibited somatic hypermutations (SHM), although to a lower extent than SM B cells. Moreover, among MZ B-cell rearrangements, recurrent sequences were observed, some of which displayed intraclonal diversification. The same diversifying sequences were detected in very low numbers in GC and FM B cells and only when a highly sensitive, gene-specific polymerase chain reaction was used. This result indicates that MZ B cells could expand and diversify in situ and also suggested the presence of a number of activation-induced cytidine deaminase (AID)-expressing B cells in the MZ. The notion of antigen-driven expansion/selection in situ is further supported by the VH CDR3 features of MZ B cells with highly conserved amino acids at specific positions and by the finding of shared (“stereotyped”) sequences in two different spleens. Collectively, the data are consistent with the notion that MZ B cells are a special subset selected by in situ antigenic stimuli. PMID:23877718

  18. Identification and correction of systematic error in high-throughput sequence data

    PubMed Central

    2011-01-01

    Background A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "next-gen" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations. Results We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets. Conclusions Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments. PMID:22099972

  19. Tardigrade workbench: comparing stress-related proteins, sequence-similar and functional protein clusters as well as RNA elements in tardigrades

    PubMed Central

    2009-01-01

    Background Tardigrades represent an animal phylum with extraordinary resistance to environmental stress. Results To gain insights into their stress-specific adaptation potential, major clusters of related and similar proteins are identified, as well as specific functional clusters delineated comparing all tardigrades and individual species (Milnesium tardigradum, Hypsibius dujardini, Echiniscus testudo, Tulinus stephaniae, Richtersius coronifer) and functional elements in tardigrade mRNAs are analysed. We find that 39.3% of the total sequences clustered in 58 clusters of more than 20 proteins. Among these are ten tardigrade specific as well as a number of stress-specific protein clusters. Tardigrade-specific functional adaptations include strong protein, DNA- and redox protection, maintenance and protein recycling. Specific regulatory elements regulate tardigrade mRNA stability such as lox P DICE elements whereas 14 other RNA elements of higher eukaryotes are not found. Further features of tardigrade specific adaption are rapidly identified by sequence and/or pattern search on the web-tool tardigrade analyzer http://waterbear.bioapps.biozentrum.uni-wuerzburg.de. The work-bench offers nucleotide pattern analysis for promotor and regulatory element detection (tardigrade specific; nrdb) as well as rapid COG search for function assignments including species-specific repositories of all analysed data. Conclusion Different protein clusters and regulatory elements implicated in tardigrade stress adaptations are analysed including unpublished tardigrade sequences. PMID:19821996

  20. Tardigrade workbench: comparing stress-related proteins, sequence-similar and functional protein clusters as well as RNA elements in tardigrades.

    PubMed

    Förster, Frank; Liang, Chunguang; Shkumatov, Alexander; Beisser, Daniela; Engelmann, Julia C; Schnölzer, Martina; Frohme, Marcus; Müller, Tobias; Schill, Ralph O; Dandekar, Thomas

    2009-10-12

    Tardigrades represent an animal phylum with extraordinary resistance to environmental stress. To gain insights into their stress-specific adaptation potential, major clusters of related and similar proteins are identified, as well as specific functional clusters delineated comparing all tardigrades and individual species (Milnesium tardigradum, Hypsibius dujardini, Echiniscus testudo, Tulinus stephaniae, Richtersius coronifer) and functional elements in tardigrade mRNAs are analysed. We find that 39.3% of the total sequences clustered in 58 clusters of more than 20 proteins. Among these are ten tardigrade specific as well as a number of stress-specific protein clusters. Tardigrade-specific functional adaptations include strong protein, DNA- and redox protection, maintenance and protein recycling. Specific regulatory elements regulate tardigrade mRNA stability such as lox P DICE elements whereas 14 other RNA elements of higher eukaryotes are not found. Further features of tardigrade specific adaption are rapidly identified by sequence and/or pattern search on the web-tool tardigrade analyzer http://waterbear.bioapps.biozentrum.uni-wuerzburg.de. The work-bench offers nucleotide pattern analysis for promotor and regulatory element detection (tardigrade specific; nrdb) as well as rapid COG search for function assignments including species-specific repositories of all analysed data. Different protein clusters and regulatory elements implicated in tardigrade stress adaptations are analysed including unpublished tardigrade sequences.

  1. Histoimmunogenetics Markup Language 1.0: Reporting next generation sequencing-based HLA and KIR genotyping.

    PubMed

    Milius, Robert P; Heuer, Michael; Valiga, Daniel; Doroschak, Kathryn J; Kennedy, Caleb J; Bolon, Yung-Tsi; Schneider, Joel; Pollack, Jane; Kim, Hwa Ran; Cereb, Nezih; Hollenbach, Jill A; Mack, Steven J; Maiers, Martin

    2015-12-01

    We present an electronic format for exchanging data for HLA and KIR genotyping with extensions for next-generation sequencing (NGS). This format addresses NGS data exchange by refining the Histoimmunogenetics Markup Language (HML) to conform to the proposed Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) reporting guidelines (miring.immunogenomics.org). Our refinements of HML include two major additions. First, NGS is supported by new XML structures to capture additional NGS data and metadata required to produce a genotyping result, including analysis-dependent (dynamic) and method-dependent (static) components. A full genotype, consensus sequence, and the surrounding metadata are included directly, while the raw sequence reads and platform documentation are externally referenced. Second, genotype ambiguity is fully represented by integrating Genotype List Strings, which use a hierarchical set of delimiters to represent allele and genotype ambiguity in a complete and accurate fashion. HML also continues to enable the transmission of legacy methods (e.g. site-specific oligonucleotide, sequence-specific priming, and Sequence Based Typing (SBT)), adding features such as allowing multiple group-specific sequencing primers, and fully leveraging techniques that combine multiple methods to obtain a single result, such as SBT integrated with NGS. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  2. Equilibrious Strand Exchange Promoted by DNA Conformational Switching

    NASA Astrophysics Data System (ADS)

    Wu, Zhiguo; Xie, Xiao; Li, Puzhen; Zhao, Jiayi; Huang, Lili; Zhou, Xiang

    2013-01-01

    Most of DNA strand exchange reactions in vitro are based on toehold strategy which is generally nonequilibrium, and intracellular strand exchange mediated by proteins shows little sequence specificity. Herein, a new strand exchange promoted by equilibrious DNA conformational switching is verified. Duplexes containing c-myc sequence which is potentially converted into G-quadruplex are designed in this strategy. The dynamic equilibrium between duplex and G4-DNA is response to the specific exchange of homologous single-stranded DNA (ssDNA). The SER is enzyme free and sequence specific. No ATP is needed and the displaced ssDNAs are identical to the homologous ssDNAs. The SER products and exchange kenetics are analyzed by PAGE and the RecA mediated SER is performed as the contrast. This SER is a new feature of G4-DNAs and a novel strategy to utilize the dynamic equilibrium of DNA conformations.

  3. Distinct retroelement classes define evolutionary breakpoints demarcating sites of evolutionary novelty

    PubMed Central

    Longo, Mark S; Carone, Dawn M; Green, Eric D; O'Neill, Michael J; O'Neill, Rachel J

    2009-01-01

    Background Large-scale genome rearrangements brought about by chromosome breaks underlie numerous inherited diseases, initiate or promote many cancers and are also associated with karyotype diversification during species evolution. Recent research has shown that these breakpoints are nonrandomly distributed throughout the mammalian genome and many, termed "evolutionary breakpoints" (EB), are specific genomic locations that are "reused" during karyotypic evolution. When the phylogenetic trajectory of orthologous chromosome segments is considered, many of these EB are coincident with ancient centromere activity as well as new centromere formation. While EB have been characterized as repeat-rich regions, it has not been determined whether specific sequences have been retained during evolution that would indicate previous centromere activity or a propensity for new centromere formation. Likewise, the conservation of specific sequence motifs or classes at EBs among divergent mammalian taxa has not been determined. Results To define conserved sequence features of EBs associated with centromere evolution, we performed comparative sequence analysis of more than 4.8 Mb within the tammar wallaby, Macropus eugenii, derived from centromeric regions (CEN), euchromatic regions (EU), and an evolutionary breakpoint (EB) that has undergone convergent breakpoint reuse and past centromere activity in marsupials. We found a dramatic enrichment for long interspersed nucleotide elements (LINE1s) and endogenous retroviruses (ERVs) and a depletion of short interspersed nucleotide elements (SINEs) shared between CEN and EBs. We analyzed the orthologous human EB (14q32.33), known to be associated with translocations in many cancers including multiple myelomas and plasma cell leukemias, and found a conserved distribution of similar repetitive elements. Conclusion Our data indicate that EBs tracked within the class Mammalia harbor sequence features retained since the divergence of marsupials and eutherians that may have predisposed these genomic regions to large-scale chromosomal instability. PMID:19630942

  4. Structural features of the rice chromosome 4 centromere.

    PubMed

    Zhang, Yu; Huang, Yuchen; Zhang, Lei; Li, Ying; Lu, Tingting; Lu, Yiqi; Feng, Qi; Zhao, Qiang; Cheng, Zhukuan; Xue, Yongbiao; Wing, Rod A; Han, Bin

    2004-01-01

    A complete sequence of a chromosome centromere is necessary for fully understanding centromere function. We reported the sequence structures of the first complete rice chromosome centromere through sequencing a large insert bacterial artificial chromosome clone-based contig, which covered the rice chromosome 4 centromere. Complete sequencing of the 124-kb rice chromosome 4 centromere revealed that it consisted of 18 tracts of 379 tandemly arrayed repeats known as CentO and a total of 19 centromeric retroelements (CRs) but no unique sequences were detected. Four tracts, composed of 65 CentO repeats, were located in the opposite orientation, and 18 CentO tracts were flanked by 19 retroelements. The CRs were classified into four types, and the type I retroelements appeared to be more specific to rice centromeres. The preferential insert of the CRs among CentO repeats indicated that the centromere-specific retroelements may contribute to centromere expansion during evolution. The presence of three intact retrotransposons in the centromere suggests that they may be responsible for functional centromere initiation through a transcription-mediated mechanism.

  5. Statistical and linguistic features of DNA sequences

    NASA Technical Reports Server (NTRS)

    Havlin, S.; Buldyrev, S. V.; Goldberger, A. L.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.

    1995-01-01

    We present evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range--indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationary" feature of the sequence of base pairs by applying a new algorithm called Detrended Fluctuation Analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and noncoding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to all eukaryotic DNA sequences (33 301 coding and 29 453 noncoding) in the entire GenBank database. We describe a simple model to account for the presence of long-range power-law correlations which is based upon a generalization of the classic Levy walk. Finally, we describe briefly some recent work showing that the noncoding sequences have certain statistical features in common with natural languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the "redundancy" of a linguistic text in terms of a measurable entropy function. We suggest that noncoding regions in plants and invertebrates may display a smaller entropy and larger redundancy than coding regions, further supporting the possibility that noncoding regions of DNA may carry biological information.

  6. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

    PubMed

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.

  7. Cloning of a newly identified heart-specific troponin I isoform, which lacks the troponin T binding portion, using the yeast hybrid system.

    PubMed

    Suzuki, Hideaki; Arakawa, Yasuhiro; Ito, Masaki; Yamada, Hisashi; Horiguchi-Yamada, Junko

    2006-01-01

    To elucidate the molecular pathogenesis behind increased levels of laminin in cardiac muscle cells in cardiomyopathy by using a yeast hybrid screen. The present study reports the cloning of a newly identified heart-specific troponin I isoform, which is putatively linked to laminin. Future studies will explore the functional significance of this connection. Yeast two-hybrid screen analysis was performed using MLF1-interacting protein (amino acids 1 to 318) as bait. The human heart complementary DNA library was screened by using the yeast-mating method for overnight culture. Two final positive clones from the heart library were isolated. These two clones encoded the same protein, a short isoform of human cardiac troponin I (TnI) that lacked TnI exons 5 and 6. The TnI isoform has a heart-specific expression pattern and it shares several sequence features with human cardiac TnI; however, it lacks the troponin T binding portion. The heart-specific segment of the human cardiac TnI isoform shares several sequence features with human cardiac TnI, but it lacks the troponin T binding portion. These results suggest that the heart-specific TnI isoform may be involved in cardiac development and disease.

  8. Cloning of a newly identified heart-specific troponin I isoform, which lacks the troponin T binding portion, using the yeast hybrid system

    PubMed Central

    Suzuki, Hideaki; Arakawa, Yasuhiro; Ito, Masaki; Yamada, Hisashi; Horiguchi-Yamada, Junko

    2006-01-01

    OBJECTIVE To elucidate the molecular pathogenesis behind increased levels of laminin in cardiac muscle cells in cardiomyopathy by using a yeast hybrid screen. The present study reports the cloning of a newly identified heart-specific troponin I isoform, which is putatively linked to laminin. Future studies will explore the functional significance of this connection. METHODS Yeast two-hybrid screen analysis was performed using MLF1-interacting protein (amino acids 1 to 318) as bait. The human heart complementary DNA library was screened by using the yeast-mating method for overnight culture. RESULTS Two final positive clones from the heart library were isolated. These two clones encoded the same protein, a short isoform of human cardiac troponin I (TnI) that lacked TnI exons 5 and 6. The TnI isoform has a heart-specific expression pattern and it shares several sequence features with human cardiac TnI; however, it lacks the troponin T binding portion. CONCLUSION The heart-specific segment of the human cardiac TnI isoform shares several sequence features with human cardiac TnI, but it lacks the troponin T binding portion. These results suggest that the heart-specific TnI isoform may be involved in cardiac development and disease. PMID:18651010

  9. Systemic Lupus Erythematosus: Molecular Mimicry between Anti-dsDNA CDR3 Idiotype, Microbial and Self Peptides-As Antigens for Th Cells.

    PubMed

    Aas-Hanssen, Kristin; Thompson, Keith M; Bogen, Bjarne; Munthe, Ludvig A

    2015-01-01

    Systemic lupus erythematosus (SLE) is marked by a T helper (Th) cell-dependent B cell hyperresponsiveness, with frequent germinal center reactions, and gammaglobulinemia. A feature of SLE is the finding of IgG autoantibodies specific for dsDNA. The specificity of the Th cells that drive the expansion of anti-dsDNA B cells is unresolved. However, anti-microbial, anti-histone, and anti-idiotype Th cell responses have been hypothesized to play a role. It has been entirely unclear if these seemingly disparate Th cell responses and hypotheses could be related or unified. Here, we describe that H chain CDR3 idiotypes from IgG(+) B cells of lupus mice have sequence similarities with both microbial and self peptides. Matched sequences were more frequent within the mutated CDR3 repertoire and when sequences were derived from lupus mice with expanded anti-dsDNA B cells. Analyses of histone sequences showed that particular histone peptides were similar to VDJ junctions. Moreover, lupus mice had Th cell responses toward histone peptides similar to anti-dsDNA CDR3 sequences. The results suggest that Th cells in lupus may have multiple cross-reactive specificities linked to the IgVH CDR3 Id-peptide sequences as well as similar DNA-associated protein motifs.

  10. Multiple nucleotide preferences determine cleavage-site recognition by the HIV-1 and M-MuLV RNases H.

    PubMed

    Schultz, Sharon J; Zhang, Miaohua; Champoux, James J

    2010-03-19

    The RNase H activity of reverse transcriptase is required during retroviral replication and represents a potential target in antiviral drug therapies. Sequence features flanking a cleavage site influence the three types of retroviral RNase H activity: internal, DNA 3'-end-directed, and RNA 5'-end-directed. Using the reverse transcriptases of HIV-1 (human immunodeficiency virus type 1) and Moloney murine leukemia virus (M-MuLV), we evaluated how individual base preferences at a cleavage site direct retroviral RNase H specificity. Strong test cleavage sites (designated as between nucleotide positions -1 and +1) for the HIV-1 and M-MuLV enzymes were introduced into model hybrid substrates designed to assay internal or DNA 3'-end-directed cleavage, and base substitutions were tested at specific nucleotide positions. For internal cleavage, positions +1, -2, -4, -5, -10, and -14 for HIV-1 and positions +1, -2, -6, and -7 for M-MuLV significantly affected RNase H cleavage efficiency, while positions -7 and -12 for HIV-1 and positions -4, -9, and -11 for M-MuLV had more modest effects. DNA 3'-end-directed cleavage was influenced substantially by positions +1, -2, -4, and -5 for HIV-1 and positions +1, -2, -6, and -7 for M-MuLV. Cleavage-site distance from the recessed end did not affect sequence preferences for M-MuLV reverse transcriptase. Based on the identified sequence preferences, a cleavage site recognized by both HIV-1 and M-MuLV enzymes was introduced into a sequence that was otherwise resistant to RNase H. The isolated RNase H domain of M-MuLV reverse transcriptase retained sequence preferences at positions +1 and -2 despite prolific cleavage in the absence of the polymerase domain. The sequence preferences of retroviral RNase H likely reflect structural features in the substrate that favor cleavage and represent a novel specificity determinant to consider in drug design. Copyright (c) 2010 Elsevier Ltd. All rights reserved.

  11. Molecular characterization and intermolecular interaction of coat protein of Prunus necrotic ringspot virus: implications for virus assembly.

    PubMed

    Kulshrestha, Saurabh; Hallan, Vipin; Sharma, Anshul; Seth, Chandrika Attri; Chauhan, Anjali; Zaidi, Aijaz Asghar

    2013-09-01

    Coat protein (CP) and RNA3 from Prunus necrotic ringspot virus (PNRSV-rose), the most prevalent virus infecting rose in India, were characterized and regions in the coat protein important for self-interaction, during dimer formation were identified. The sequence analysis of CP and partial RNA 3 revealed that the rose isolate of PNRSV in India belongs to PV-32 group of PNRSV isolates. Apart from the already established group specific features of PV-32 group member's additional group-specific and host specific features were also identified. Presence of methionine at position 90 in the amino acid sequence alignment of PNRSV CP gene (belonging to PV-32 group) was identified as the specific conserved feature for the rose isolates of PNRSV. As protein-protein interaction plays a vital role in the infection process, an attempt was made to identify the portions of PNRSV CP responsible for self-interaction using yeast two-hybrid system. It was found (after analysis of the deletion clones) that the C-terminal region of PNRSV CP (amino acids 153-226) plays a vital role in this interaction during dimer formation. N-terminal of PNRSV CP is previously known to be involved in CP-RNA interactions, but our results also suggested that N-terminal of PNRSV CP represented by amino acids 1-77 also interacts with C-terminal (amino acids 153-226) in yeast two-hybrid system, suggesting its probable involvement in the CP-CP interaction.

  12. Principles for Predicting RNA Secondary Structure Design Difficulty.

    PubMed

    Anderson-Lee, Jeff; Fisker, Eli; Kosaraju, Vineet; Wu, Michelle; Kong, Justin; Lee, Jeehyung; Lee, Minjae; Zada, Mathew; Treuille, Adrien; Das, Rhiju

    2016-02-27

    Designing RNAs that form specific secondary structures is enabling better understanding and control of living systems through RNA-guided silencing, genome editing and protein organization. Little is known, however, about which RNA secondary structures might be tractable for downstream sequence design, increasing the time and expense of design efforts due to inefficient secondary structure choices. Here, we present insights into specific structural features that increase the difficulty of finding sequences that fold into a target RNA secondary structure, summarizing the design efforts of tens of thousands of human participants and three automated algorithms (RNAInverse, INFO-RNA and RNA-SSD) in the Eterna massive open laboratory. Subsequent tests through three independent RNA design algorithms (NUPACK, DSS-Opt and MODENA) confirmed the hypothesized importance of several features in determining design difficulty, including sequence length, mean stem length, symmetry and specific difficult-to-design motifs such as zigzags. Based on these results, we have compiled an Eterna100 benchmark of 100 secondary structure design challenges that span a large range in design difficulty to help test future efforts. Our in silico results suggest new routes for improving computational RNA design methods and for extending these insights to assess "designability" of single RNA structures, as well as of switches for in vitro and in vivo applications. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  13. Predicting the binding preference of transcription factors to individual DNA k-mers.

    PubMed

    Alleyne, Trevis M; Peña-Castillo, Lourdes; Badis, Gwenael; Talukder, Shaheynoor; Berger, Michael F; Gehrke, Andrew R; Philippakis, Anthony A; Bulyk, Martha L; Morris, Quaid D; Hughes, Timothy R

    2009-04-15

    Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members. We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

  14. Integrated circuit layer image segmentation

    NASA Astrophysics Data System (ADS)

    Masalskis, Giedrius; Petrauskas, Romas

    2010-09-01

    In this paper we present IC layer image segmentation techniques which are specifically created for precise metal layer feature extraction. During our research we used many samples of real-life de-processed IC metal layer images which were obtained using optical light microscope. We have created sequence of various image processing filters which provides segmentation results of good enough precision for our application. Filter sequences were fine tuned to provide best possible results depending on properties of IC manufacturing process and imaging technology. Proposed IC image segmentation filter sequences were experimentally tested and compared with conventional direct segmentation algorithms.

  15. MSLICE Sequencing

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Norris, Jeffrey S.; Morris, John R.

    2011-01-01

    MSLICE Sequencing is a graphical tool for writing sequences and integrating them into RML files, as well as for producing SCMF files for uplink. When operated in a testbed environment, it also supports uplinking these SCMF files to the testbed via Chill. This software features a free-form textural sequence editor featuring syntax coloring, automatic content assistance (including command and argument completion proposals), complete with types, value ranges, unites, and descriptions from the command dictionary that appear as they are typed. The sequence editor also has a "field mode" that allows tabbing between arguments and displays type/range/units/description for each argument as it is edited. Color-coded error and warning annotations on problematic tokens are included, as well as indications of problems that are not visible in the current scroll range. "Quick Fix" suggestions are made for resolving problems, and all the features afforded by modern source editors are also included such as copy/cut/paste, undo/redo, and a sophisticated find-and-replace system optionally using regular expressions. The software offers a full XML editor for RML files, which features syntax coloring, content assistance and problem annotations as above. There is a form-based, "detail view" that allows structured editing of command arguments and sequence parameters when preferred. The "project view" shows the user s "workspace" as a tree of "resources" (projects, folders, and files) that can subsequently be opened in editors by double-clicking. Files can be added, deleted, dragged-dropped/copied-pasted between folders or projects, and these operations are undoable and redoable. A "problems view" contains a tabular list of all problems in the current workspace. Double-clicking on any row in the table opens an editor for the appropriate sequence, scrolling to the specific line with the problem, and highlighting the problematic characters. From there, one can invoke "quick fix" as described above to resolve the issue. Once resolved, saving the file causes the problem to be removed from the problem view.

  16. GBshape: a genome browser database for DNA shape annotations

    PubMed Central

    Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin; Main, Bradley J.; Parker, Stephen C.J.; Nuzhdin, Sergey V.; Tullius, Thomas D.; Rohs, Remo

    2015-01-01

    Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species. PMID:25326329

  17. IDEPI: rapid prediction of HIV-1 antibody epitopes and other phenotypic features from sequence data using a flexible machine learning platform.

    PubMed

    Hepler, N Lance; Scheffler, Konrad; Weaver, Steven; Murrell, Ben; Richman, Douglas D; Burton, Dennis R; Poignard, Pascal; Smith, Davey M; Kosakovsky Pond, Sergei L

    2014-09-01

    Since its identification in 1983, HIV-1 has been the focus of a research effort unprecedented in scope and difficulty, whose ultimate goals--a cure and a vaccine--remain elusive. One of the fundamental challenges in accomplishing these goals is the tremendous genetic variability of the virus, with some genes differing at as many as 40% of nucleotide positions among circulating strains. Because of this, the genetic bases of many viral phenotypes, most notably the susceptibility to neutralization by a particular antibody, are difficult to identify computationally. Drawing upon open-source general-purpose machine learning algorithms and libraries, we have developed a software package IDEPI (IDentify EPItopes) for learning genotype-to-phenotype predictive models from sequences with known phenotypes. IDEPI can apply learned models to classify sequences of unknown phenotypes, and also identify specific sequence features which contribute to a particular phenotype. We demonstrate that IDEPI achieves performance similar to or better than that of previously published approaches on four well-studied problems: finding the epitopes of broadly neutralizing antibodies (bNab), determining coreceptor tropism of the virus, identifying compartment-specific genetic signatures of the virus, and deducing drug-resistance associated mutations. The cross-platform Python source code (released under the GPL 3.0 license), documentation, issue tracking, and a pre-configured virtual machine for IDEPI can be found at https://github.com/veg/idepi.

  18. Computational Prediction of Protein Epsilon Lysine Acetylation Sites Based on a Feature Selection Method.

    PubMed

    Gao, JianZhao; Tao, Xue-Wen; Zhao, Jia; Feng, Yuan-Ming; Cai, Yu-Dong; Zhang, Ning

    2017-01-01

    Lysine acetylation, as one type of post-translational modifications (PTM), plays key roles in cellular regulations and can be involved in a variety of human diseases. However, it is often high-cost and time-consuming to use traditional experimental approaches to identify the lysine acetylation sites. Therefore, effective computational methods should be developed to predict the acetylation sites. In this study, we developed a position-specific method for epsilon lysine acetylation site prediction. Sequences of acetylated proteins were retrieved from the UniProt database. Various kinds of features such as position specific scoring matrix (PSSM), amino acid factors (AAF), and disorders were incorporated. A feature selection method based on mRMR (Maximum Relevance Minimum Redundancy) and IFS (Incremental Feature Selection) was employed. Finally, 319 optimal features were selected from total 541 features. Using the 319 optimal features to encode peptides, a predictor was constructed based on dagging. As a result, an accuracy of 69.56% with MCC of 0.2792 was achieved. We analyzed the optimal features, which suggested some important factors determining the lysine acetylation sites. We developed a position-specific method for epsilon lysine acetylation site prediction. A set of optimal features was selected. Analysis of the optimal features provided insights into the mechanism of lysine acetylation sites, providing guidance of experimental validation. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  19. Multiplex analysis of DNA

    DOEpatents

    Church, George M.; Kieffer-Higgins, Stephen

    1992-01-01

    This invention features vectors and a method for sequencing DNA. The method includes the steps of: a) ligating the DNA into a vector comprising a tag sequence, the tag sequence includes at least 15 bases, wherein the tag sequence will not hybridize to the DNA under stringent hybridization conditions and is unique in the vector, to form a hybrid vector, b) treating the hybrid vector in a plurality of vessels to produce fragments comprising the tag sequence, wherein the fragments differ in length and terminate at a fixed known base or bases, wherein the fixed known base or bases differs in each vessel, c) separating the fragments from each vessel according to their size, d) hybridizing the fragments with an oligonucleotide able to hybridize specifically with the tag sequence, and e) detecting the pattern of hybridization of the tag sequence, wherein the pattern reflects the nucleotide sequence of the DNA.

  20. Molecular evolution of miraculin-like proteins in soybean Kunitz super-family.

    PubMed

    Selvakumar, Purushotham; Gahloth, Deepankar; Tomar, Prabhat Pratap Singh; Sharma, Nidhi; Sharma, Ashwani Kumar

    2011-12-01

    Miraculin-like proteins (MLPs) belong to soybean Kunitz super-family and have been characterized from many plant families like Rutaceae, Solanaceae, Rubiaceae, etc. Many of them possess trypsin inhibitory activity and are involved in plant defense. MLPs exhibit significant sequence identity (~30-95%) to native miraculin protein, also belonging to Kunitz super-family compared with a typical Kunitz family member (~30%). The sequence and structure-function comparison of MLPs with that of a classical Kunitz inhibitor have demonstrated that MLPs have evolved to form a distinct group within Kunitz super-family. Sequence analysis of new genes along with available MLP sequences in the literature revealed three major groups for these proteins. A significant feature of Rutaceae MLP type 2 sequences is the presence of phosphorylation motif. Subtle changes are seen in putative reactive loop residues among different MLPs suggesting altered specificities to specific proteases. In phylogenetic analysis, Rutaceae MLP type 1 and type 2 proteins clustered together on separate branches, whereas native miraculin along with other MLPs formed distinct clusters. Site-specific positive Darwinian selection was observed at many sites in both the groups of Rutaceae MLP sequences with most of the residues undergoing positive selection located in loop regions. The results demonstrate the sequence and thereby the structure-function divergence of MLPs as a distinct group within soybean Kunitz super-family due to biotic and abiotic stresses of local environment.

  1. Groupwise registration of cardiac perfusion MRI sequences using normalized mutual information in high dimension

    NASA Astrophysics Data System (ADS)

    Hamrouni, Sameh; Rougon, Nicolas; Pr"teux, Françoise

    2011-03-01

    In perfusion MRI (p-MRI) exams, short-axis (SA) image sequences are captured at multiple slice levels along the long-axis of the heart during the transit of a vascular contrast agent (Gd-DTPA) through the cardiac chambers and muscle. Compensating cardio-thoracic motions is a requirement for enabling computer-aided quantitative assessment of myocardial ischaemia from contrast-enhanced p-MRI sequences. The classical paradigm consists of registering each sequence frame on a reference image using some intensity-based matching criterion. In this paper, we introduce a novel unsupervised method for the spatio-temporal groupwise registration of cardiac p-MRI exams based on normalized mutual information (NMI) between high-dimensional feature distributions. Here, local contrast enhancement curves are used as a dense set of spatio-temporal features, and statistically matched through variational optimization to a target feature distribution derived from a registered reference template. The hard issue of probability density estimation in high-dimensional state spaces is bypassed by using consistent geometric entropy estimators, allowing NMI to be computed directly from feature samples. Specifically, a computationally efficient kth-nearest neighbor (kNN) estimation framework is retained, leading to closed-form expressions for the gradient flow of NMI over finite- and infinite-dimensional motion spaces. This approach is applied to the groupwise alignment of cardiac p-MRI exams using a free-form Deformation (FFD) model for cardio-thoracic motions. Experiments on simulated and natural datasets suggest its accuracy and robustness for registering p-MRI exams comprising more than 30 frames.

  2. SPAR: small RNA-seq portal for analysis of sequencing experiments.

    PubMed

    Kuksa, Pavel P; Amlie-Wolf, Alexandre; Katanic, Živadin; Valladares, Otto; Wang, Li-San; Leung, Yuk Yee

    2018-05-04

    The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.

  3. Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees.

    PubMed

    Williams, Philip H; Eyles, Rod; Weiller, Georg

    2012-01-01

    MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.

  4. Co-Conserved MAPK Features Couple D-Domain Docking Groove to Distal Allosteric Sites via the C-Terminal Flanking Tail

    PubMed Central

    Nguyen, Tuan; Ruan, Zheng; Oruganty, Krishnadev; Kannan, Natarajan

    2015-01-01

    Mitogen activated protein kinases (MAPKs) form a closely related family of kinases that control critical pathways associated with cell growth and survival. Although MAPKs have been extensively characterized at the biochemical, cellular, and structural level, an integrated evolutionary understanding of how MAPKs differ from other closely related protein kinases is currently lacking. Here, we perform statistical sequence comparisons of MAPKs and related protein kinases to identify sequence and structural features associated with MAPK functional divergence. We show, for the first time, that virtually all MAPK-distinguishing sequence features, including an unappreciated short insert segment in the β4-β5 loop, physically couple distal functional sites in the kinase domain to the D-domain peptide docking groove via the C-terminal flanking tail (C-tail). The coupling mediated by MAPK-specific residues confers an allosteric regulatory mechanism unique to MAPKs. In particular, the regulatory αC-helix conformation is controlled by a MAPK-conserved salt bridge interaction between an arginine in the αC-helix and an acidic residue in the C-tail. The salt-bridge interaction is modulated in unique ways in individual sub-families to achieve regulatory specificity. Our study is consistent with a model in which the C-tail co-evolved with the D-domain docking site to allosterically control MAPK activity. Our study provides testable mechanistic hypotheses for biochemical characterization of MAPK-conserved residues and new avenues for the design of allosteric MAPK inhibitors. PMID:25799139

  5. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction

    PubMed Central

    Laehnemann, David; Borkhardt, Arndt

    2016-01-01

    Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here. PMID:26026159

  6. The role of DNA repair in herpesvirus pathogenesis.

    PubMed

    Brown, Jay C

    2014-10-01

    In cells latently infected with a herpesvirus, the viral DNA is present in the cell nucleus, but it is not extensively replicated or transcribed. In this suppressed state the virus DNA is vulnerable to mutagenic events that affect the host cell and have the potential to destroy the virus' genetic integrity. Despite the potential for genetic damage, however, herpesvirus sequences are well conserved after reactivation from latency. To account for this apparent paradox, I have tested the idea that host cell-encoded mechanisms of DNA repair are able to control genetic damage to latent herpesviruses. Studies were focused on homologous recombination-dependent DNA repair (HR). Methods of DNA sequence analysis were employed to scan herpesvirus genomes for DNA features able to activate HR. Analyses were carried out with a total of 39 herpesvirus DNA sequences, a group that included viruses from the alpha-, beta- and gamma-subfamilies. The results showed that all 39 genome sequences were enriched in two or more of the eight recombination-initiating features examined. The results were interpreted to indicate that HR can stabilize latent herpesvirus genomes. The results also showed, unexpectedly, that repair-initiating DNA features differed in alpha- compared to gamma-herpesviruses. Whereas inverted and tandem repeats predominated in alpha-herpesviruses, gamma-herpesviruses were enriched in short, GC-rich initiation sequences such as CCCAG and depleted in repeats. In alpha-herpesviruses, repair-initiating repeat sequences were found to be concentrated in a specific region (the S segment) of the genome while repair-initiating short sequences were distributed more uniformly in gamma-herpesviruses. The results suggest that repair pathways are activated differently in alpha- compared to gamma-herpesviruses. Copyright © 2014. Published by Elsevier Inc.

  7. Comprehensive definition of genome features in Spirodela polyrhiza by high-depth physical mapping and short-read DNA sequencing strategies.

    PubMed

    Michael, Todd P; Bryant, Douglas; Gutierrez, Ryan; Borisjuk, Nikolai; Chu, Philomena; Zhang, Hanzhong; Xia, Jing; Zhou, Junfei; Peng, Hai; El Baidouri, Moaine; Ten Hallers, Boudewijn; Hastie, Alex R; Liang, Tiffany; Acosta, Kenneth; Gilbert, Sarah; McEntee, Connor; Jackson, Scott A; Mockler, Todd C; Zhang, Weixiong; Lam, Eric

    2017-02-01

    Spirodela polyrhiza is a fast-growing aquatic monocot with highly reduced morphology, genome size and number of protein-coding genes. Considering these biological features of Spirodela and its basal position in the monocot lineage, understanding its genome architecture could shed light on plant adaptation and genome evolution. Like many draft genomes, however, the 158-Mb Spirodela genome sequence has not been resolved to chromosomes, and important genome characteristics have not been defined. Here we deployed rapid genome-wide physical maps combined with high-coverage short-read sequencing to resolve the 20 chromosomes of Spirodela and to empirically delineate its genome features. Our data revealed a dramatic reduction in the number of the rDNA repeat units in Spirodela to fewer than 100, which is even fewer than that reported for yeast. Consistent with its unique phylogenetic position, small RNA sequencing revealed 29 Spirodela-specific microRNA, with only two being shared with Elaeis guineensis (oil palm) and Musa balbisiana (banana). Combining DNA methylation data and small RNA sequencing enabled the accurate prediction of 20.5% long terminal repeats (LTRs) that doubled the previous estimate, and revealed a high Solo:Intact LTR ratio of 8.2. Interestingly, we found that Spirodela has the lowest global DNA methylation levels (9%) of any plant species tested. Taken together our results reveal a genome that has undergone reduction, likely through eliminating non-essential protein coding genes, rDNA and LTRs. In addition to delineating the genome features of this unique plant, the methodologies described and large-scale genome resources from this work will enable future evolutionary and functional studies of this basal monocot family. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.

  8. The siRNA Non-seed Region and Its Target Sequences Are Auxiliary Determinants of Off-Target Effects.

    PubMed

    Kamola, Piotr J; Nakano, Yuko; Takahashi, Tomoko; Wilson, Paul A; Ui-Tei, Kumiko

    2015-12-01

    RNA interference (RNAi) is a powerful tool for post-transcriptional gene silencing. However, the siRNA guide strand may bind unintended off-target transcripts via partial sequence complementarity by a mechanism closely mirroring micro RNA (miRNA) silencing. To better understand these off-target effects, we investigated the correlation between sequence features within various subsections of siRNA guide strands, and its corresponding target sequences, with off-target activities. Our results confirm previous reports that strength of base-pairing in the siRNA seed region is the primary factor determining the efficiency of off-target silencing. However, the degree of downregulation of off-target transcripts with shared seed sequence is not necessarily similar, suggesting that there are additional auxiliary factors that influence the silencing potential. Here, we demonstrate that both the melting temperature (Tm) in a subsection of siRNA non-seed region, and the GC contents of its corresponding target sequences, are negatively correlated with the efficiency of off-target effect. Analysis of experimentally validated miRNA targets demonstrated a similar trend, indicating a putative conserved mechanistic feature of seed region-dependent targeting mechanism. These observations may prove useful as parameters for off-target prediction algorithms and improve siRNA 'specificity' design rules.

  9. GuiTope: an application for mapping random-sequence peptides to protein sequences.

    PubMed

    Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert

    2012-01-03

    Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.

  10. PAM multiplicity marks genomic target sites as inhibitory to CRISPR-Cas9 editing.

    PubMed

    Malina, Abba; Cameron, Christopher J F; Robert, Francis; Blanchette, Mathieu; Dostie, Josée; Pelletier, Jerry

    2015-12-08

    In CRISPR-Cas9 genome editing, the underlying principles for selecting guide RNA (gRNA) sequences that would ensure for efficient target site modification remain poorly understood. Here we show that target sites harbouring multiple protospacer adjacent motifs (PAMs) are refractory to Cas9-mediated repair in situ. Thus we refine which substrates should be avoided in gRNA design, implicating PAM density as a novel sequence-specific feature that inhibits in vivo Cas9-driven DNA modification.

  11. PAM multiplicity marks genomic target sites as inhibitory to CRISPR-Cas9 editing

    PubMed Central

    Malina, Abba; Cameron, Christopher J. F.; Robert, Francis; Blanchette, Mathieu; Dostie, Josée; Pelletier, Jerry

    2015-01-01

    In CRISPR-Cas9 genome editing, the underlying principles for selecting guide RNA (gRNA) sequences that would ensure for efficient target site modification remain poorly understood. Here we show that target sites harbouring multiple protospacer adjacent motifs (PAMs) are refractory to Cas9-mediated repair in situ. Thus we refine which substrates should be avoided in gRNA design, implicating PAM density as a novel sequence-specific feature that inhibits in vivo Cas9-driven DNA modification. PMID:26644285

  12. Systematic evaluation of the impact of ChIP-seq read designs on genome coverage, peak identification, and allele-specific binding detection.

    PubMed

    Zhang, Qi; Zeng, Xin; Younkin, Sam; Kawli, Trupti; Snyder, Michael P; Keleş, Sündüz

    2016-02-24

    Chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments revolutionized genome-wide profiling of transcription factors and histone modifications. Although maturing sequencing technologies allow these experiments to be carried out with short (36-50 bps), long (75-100 bps), single-end, or paired-end reads, the impact of these read parameters on the downstream data analysis are not well understood. In this paper, we evaluate the effects of different read parameters on genome sequence alignment, coverage of different classes of genomic features, peak identification, and allele-specific binding detection. We generated 101 bps paired-end ChIP-seq data for many transcription factors from human GM12878 and MCF7 cell lines. Systematic evaluations using in silico variations of these data as well as fully simulated data, revealed complex interplay between the sequencing parameters and analysis tools, and indicated clear advantages of paired-end designs in several aspects such as alignment accuracy, peak resolution, and most notably, allele-specific binding detection. Our work elucidates the effect of design on the downstream analysis and provides insights to investigators in deciding sequencing parameters in ChIP-seq experiments. We present the first systematic evaluation of the impact of ChIP-seq designs on allele-specific binding detection and highlights the power of pair-end designs in such studies.

  13. Specific binding of a HeLa cell nuclear protein to RNA sequences in the human immunodeficiency virus transactivating region.

    PubMed Central

    Gaynor, R; Soultanakis, E; Kuwabara, M; Garcia, J; Sigman, D S

    1989-01-01

    The transactivator protein, tat, encoded by the human immunodeficiency virus is a key regulator of viral transcription. Activation by the tat protein requires sequences downstream of the transcription initiation site called the transactivating region (TAR). RNA derived from the TAR is capable of forming a stable stem-loop structure and the maintenance of both the stem structure and the loop sequences located between +19 and +44 is required for complete in vivo activation by tat. Gel retardation assays with RNA from both wild-type and mutant TAR constructs generated in vitro with SP6 polymerase indicated specific binding of HeLa nuclear proteins to the TAR. To characterize this RNA-protein interaction, a method of chemical "imprinting" has been developed using photoactivated uranyl acetate as the nucleolytic agent. This reagent nicks RNA under physiological conditions at all four nucleotides in a reaction that is independent of sequence and secondary structure. Specific interaction of cellular proteins with TAR RNA could be detected by enhanced cleavages or imprints surrounding the loop region. Mutations that either disrupted stem base-pairing or extensively changed the primary sequence resulted in alterations in the cleavage pattern of the TAR RNA. Structural features of the TAR RNA stem-loop essential for tat activation are also required for specific binding of the HeLa cell nuclear protein. Images PMID:2544877

  14. Protein structure based prediction of catalytic residues.

    PubMed

    Fajardo, J Eduardo; Fiser, Andras

    2013-02-22

    Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation. We explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods. We found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases.

  15. Structural features based genome-wide characterization and prediction of nucleosome organization

    PubMed Central

    2012-01-01

    Background Nucleosome distribution along chromatin dictates genomic DNA accessibility and thus profoundly influences gene expression. However, the underlying mechanism of nucleosome formation remains elusive. Here, taking a structural perspective, we systematically explored nucleosome formation potential of genomic sequences and the effect on chromatin organization and gene expression in S. cerevisiae. Results We analyzed twelve structural features related to flexibility, curvature and energy of DNA sequences. The results showed that some structural features such as DNA denaturation, DNA-bending stiffness, Stacking energy, Z-DNA, Propeller twist and free energy, were highly correlated with in vitro and in vivo nucleosome occupancy. Specifically, they can be classified into two classes, one positively and the other negatively correlated with nucleosome occupancy. These two kinds of structural features facilitated nucleosome binding in centromere regions and repressed nucleosome formation in the promoter regions of protein-coding genes to mediate transcriptional regulation. Based on these analyses, we integrated all twelve structural features in a model to predict more accurately nucleosome occupancy in vivo than the existing methods that mainly depend on sequence compositional features. Furthermore, we developed a novel approach, named DLaNe, that located nucleosomes by detecting peaks of structural profiles, and built a meta predictor to integrate information from different structural features. As a comparison, we also constructed a hidden Markov model (HMM) to locate nucleosomes based on the profiles of these structural features. The result showed that the meta DLaNe and HMM-based method performed better than the existing methods, demonstrating the power of these structural features in predicting nucleosome positions. Conclusions Our analysis revealed that DNA structures significantly contribute to nucleosome organization and influence chromatin structure and gene expression regulation. The results indicated that our proposed methods are effective in predicting nucleosome occupancy and positions and that these structural features are highly predictive of nucleosome organization. The implementation of our DLaNe method based on structural features is available online. PMID:22449207

  16. 3D brain tumor segmentation in multimodal MR images based on learning population- and patient-specific feature sets.

    PubMed

    Jiang, Jun; Wu, Yao; Huang, Meiyan; Yang, Wei; Chen, Wufan; Feng, Qianjin

    2013-01-01

    Brain tumor segmentation is a clinical requirement for brain tumor diagnosis and radiotherapy planning. Automating this process is a challenging task due to the high diversity in appearance of tumor tissue among different patients and the ambiguous boundaries of lesions. In this paper, we propose a method to construct a graph by learning the population- and patient-specific feature sets of multimodal magnetic resonance (MR) images and by utilizing the graph-cut to achieve a final segmentation. The probabilities of each pixel that belongs to the foreground (tumor) and the background are estimated by global and custom classifiers that are trained through learning population- and patient-specific feature sets, respectively. The proposed method is evaluated using 23 glioma image sequences, and the segmentation results are compared with other approaches. The encouraging evaluation results obtained, i.e., DSC (84.5%), Jaccard (74.1%), sensitivity (87.2%), and specificity (83.1%), show that the proposed method can effectively make use of both population- and patient-specific information. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.

  17. An ensemble approach for large-scale identification of protein-protein interactions using the alignments of multiple sequences

    PubMed Central

    Wang, Lei; You, Zhu-Hong; Chen, Xing; Li, Jian-Qiang; Yan, Xin; Zhang, Wei; Huang, Yu-An

    2017-01-01

    Protein–Protein Interactions (PPI) is not only the critical component of various biological processes in cells, but also the key to understand the mechanisms leading to healthy and diseased states in organisms. However, it is time-consuming and cost-intensive to identify the interactions among proteins using biological experiments. Hence, how to develop a more efficient computational method rapidly became an attractive topic in the post-genomic era. In this paper, we propose a novel method for inference of protein-protein interactions from protein amino acids sequences only. Specifically, protein amino acids sequence is firstly transformed into Position-Specific Scoring Matrix (PSSM) generated by multiple sequences alignments; then the Pseudo PSSM is used to extract feature descriptors. Finally, ensemble Rotation Forest (RF) learning system is trained to predict and recognize PPIs based solely on protein sequence feature. When performed the proposed method on the three benchmark data sets (Yeast, H. pylori, and independent dataset) for predicting PPIs, our method can achieve good average accuracies of 98.38%, 89.75%, and 96.25%, respectively. In order to further evaluate the prediction performance, we also compare the proposed method with other methods using same benchmark data sets. The experiment results demonstrate that the proposed method consistently outperforms other state-of-the-art method. Therefore, our method is effective and robust and can be taken as a useful tool in exploring and discovering new relationships between proteins. A web server is made publicly available at the URL http://202.119.201.126:8888/PsePSSM/ for academic use. PMID:28029645

  18. Plastome Sequence Determination and Comparative Analysis for Members of the Lolium-Festuca Grass Species Complex

    PubMed Central

    Hand, Melanie L.; Spangenberg, German C.; Forster, John W.; Cogan, Noel O. I.

    2013-01-01

    Chloroplast genome sequences are of broad significance in plant biology, due to frequent use in molecular phylogenetics, comparative genomics, population genetics, and genetic modification studies. The present study used a second-generation sequencing approach to determine and assemble the plastid genomes (plastomes) of four representatives from the agriculturally important Lolium-Festuca species complex of pasture grasses (Lolium multiflorum, Festuca pratensis, Festuca altissima, and Festuca ovina). Total cellular DNA was extracted from either roots or leaves, was sequenced, and the output was filtered for plastome-related reads. A comparison between sources revealed fewer plastome-related reads from root-derived template but an increase in incidental bacterium-derived sequences. Plastome assembly and annotation indicated high levels of sequence identity and a conserved organization and gene content between species. However, frequent deletions within the F. ovina plastome appeared to contribute to a smaller plastid genome size. Comparative analysis with complete plastome sequences from other members of the Poaceae confirmed conservation of most grass-specific features. Detailed analysis of the rbcL–psaI intergenic region, however, revealed a “hot-spot” of variation characterized by independent deletion events. The evolutionary implications of this observation are discussed. The complete plastome sequences are anticipated to provide the basis for potential organelle-specific genetic modification of pasture grasses. PMID:23550121

  19. Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation

    PubMed Central

    Weng, Lingjie; Li, Yi; Xie, Xiaohui; Shi, Yongsheng

    2016-01-01

    mRNA alternative polyadenylation (APA) is a critical mechanism for post-transcriptional gene regulation and is often regulated in a tissue- and/or developmental stage-specific manner. An ultimate goal for the APA field has been to be able to computationally predict APA profiles under different physiological or pathological conditions. As a first step toward this goal, we have assembled a poly(A) code for predicting tissue-specific poly(A) sites (PASs). Based on a compendium of over 600 features that have known or potential roles in PAS selection, we have generated and refined a machine-learning algorithm using multiple high-throughput sequencing-based data sets of tissue-specific and constitutive PASs. This code can predict tissue-specific PASs with >85% accuracy. Importantly, by analyzing the prediction performance based on different RNA features, we found that PAS context, including the distance between alternative PASs and the relative position of a PAS within the gene, is a key feature for determining the susceptibility of a PAS to tissue-specific regulation. Our poly(A) code provides a useful tool for not only predicting tissue-specific APA regulation, but also for studying its underlying molecular mechanisms. PMID:27095026

  20. RNA Editing in Plant Mitochondria

    NASA Astrophysics Data System (ADS)

    Hiesel, Rudolf; Wissinger, Bernd; Schuster, Wolfgang; Brennicke, Axel

    1989-12-01

    Comparative sequence analysis of genomic and complementary DNA clones from several mitochondrial genes in the higher plant Oenothera revealed nucleotide sequence divergences between the genomic and the messenger RNA-derived sequences. These sequence alterations could be most easily explained by specific post-transcriptional nucleotide modifications. Most of the nucleotide exchanges in coding regions lead to altered codons in the mRNA that specify amino acids better conserved in evolution than those encoded by the genomic DNA. Several instances show that the genomic arginine codon CGG is edited in the mRNA to the tryptophan codon TGG in amino acid positions that are highly conserved as tryptophan in the homologous proteins of other species. This editing suggests that the standard genetic code is used in plant mitochondria and resolves the frequent coincidence of CGG codons and tryptophan in different plant species. The apparently frequent and non-species-specific equivalency of CGG and TGG codons in particular suggests that RNA editing is a common feature of all higher plant mitochondria.

  1. Electrophoretic mobility shift scanning using an automated infrared DNA sequencer.

    PubMed

    Sano, M; Ohyama, A; Takase, K; Yamamoto, M; Machida, M

    2001-11-01

    Electrophoretic mobility shift assay (EMSA) is widely used in the study of sequence-specific DNA-binding proteins, including transcription factors and mismatch binding proteins. We have established a non-radioisotope-based protocol for EMSA that features an automated DNA sequencer with an infrared fluorescent dye (IRDye) detection unit. Our modification of the elec- trophoresis unit, which includes cooling the gel plates with a reduced well-to-read length, has made it possible to detect shifted bands within 1 h. Further, we have developed a rapid ligation-based method for generating IRDye-labeled probes with an approximately 60% cost reduction. This method has the advantages of real-time scanning, stability of labeled probes, and better safety associated with nonradioactive methods of detection. Analysis of a promoter from an industrially important filamentous fungus, Aspergillus oryzae, in a prototype experiment revealed that the method we describe has potential for use in systematic scanning and identification of the functionally important elements to which cellular factors bind in a sequence-specific manner.

  2. Transfer of genetic therapy across human populations: molecular targets for increasing patient coverage in repeat expansion diseases

    PubMed Central

    Varela, Miguel A; Curtis, Helen J; Douglas, Andrew GL; Hammond, Suzan M; O'Loughlin, Aisling J; Sobrido, Maria J; Scholefield, Janine; Wood, Matthew JA

    2016-01-01

    Allele-specific gene therapy aims to silence expression of mutant alleles through targeting of disease-linked single-nucleotide polymorphisms (SNPs). However, SNP linkage to disease varies between populations, making such molecular therapies applicable only to a subset of patients. Moreover, not all SNPs have the molecular features necessary for potent gene silencing. Here we provide knowledge to allow the maximisation of patient coverage by building a comprehensive understanding of SNPs ranked according to their predicted suitability toward allele-specific silencing in 14 repeat expansion diseases: amyotrophic lateral sclerosis and frontotemporal dementia, dentatorubral-pallidoluysian atrophy, myotonic dystrophy 1, myotonic dystrophy 2, Huntington's disease and several spinocerebellar ataxias. Our systematic analysis of DNA sequence variation shows that most annotated SNPs are not suitable for potent allele-specific silencing across populations because of suboptimal sequence features and low variability (>97% in HD). We suggest maximising patient coverage by selecting SNPs with high heterozygosity across populations, and preferentially targeting SNPs that lead to purine:purine mismatches in wild-type alleles to obtain potent allele-specific silencing. We therefore provide fundamental knowledge on strategies for optimising patient coverage of therapeutics for microsatellite expansion disorders by linking analysis of population genetic variation to the selection of molecular targets. PMID:25990798

  3. Transfer of genetic therapy across human populations: molecular targets for increasing patient coverage in repeat expansion diseases.

    PubMed

    Varela, Miguel A; Curtis, Helen J; Douglas, Andrew G L; Hammond, Suzan M; O'Loughlin, Aisling J; Sobrido, Maria J; Scholefield, Janine; Wood, Matthew J A

    2016-02-01

    Allele-specific gene therapy aims to silence expression of mutant alleles through targeting of disease-linked single-nucleotide polymorphisms (SNPs). However, SNP linkage to disease varies between populations, making such molecular therapies applicable only to a subset of patients. Moreover, not all SNPs have the molecular features necessary for potent gene silencing. Here we provide knowledge to allow the maximisation of patient coverage by building a comprehensive understanding of SNPs ranked according to their predicted suitability toward allele-specific silencing in 14 repeat expansion diseases: amyotrophic lateral sclerosis and frontotemporal dementia, dentatorubral-pallidoluysian atrophy, myotonic dystrophy 1, myotonic dystrophy 2, Huntington's disease and several spinocerebellar ataxias. Our systematic analysis of DNA sequence variation shows that most annotated SNPs are not suitable for potent allele-specific silencing across populations because of suboptimal sequence features and low variability (>97% in HD). We suggest maximising patient coverage by selecting SNPs with high heterozygosity across populations, and preferentially targeting SNPs that lead to purine:purine mismatches in wild-type alleles to obtain potent allele-specific silencing. We therefore provide fundamental knowledge on strategies for optimising patient coverage of therapeutics for microsatellite expansion disorders by linking analysis of population genetic variation to the selection of molecular targets.

  4. Structural and sequence features of two residue turns in beta-hairpins.

    PubMed

    Madan, Bharat; Seo, Sung Yong; Lee, Sun-Gu

    2014-09-01

    Beta-turns in beta-hairpins have been implicated as important sites in protein folding. In particular, two residue β-turns, the most abundant connecting elements in beta-hairpins, have been a major target for engineering protein stability and folding. In this study, we attempted to investigate and update the structural and sequence properties of two residue turns in beta-hairpins with a large data set. For this, 3977 beta-turns were extracted from 2394 nonhomologous protein chains and analyzed. First, the distribution, dihedral angles and twists of two residue turn types were determined, and compared with previous data. The trend of turn type occurrence and most structural features of the turn types were similar to previous results, but for the first time Type II turns in beta-hairpins were identified. Second, sequence motifs for the turn types were devised based on amino acid positional potentials of two-residue turns, and their distributions were examined. From this study, we could identify code-like sequence motifs for the two residue beta-turn types. Finally, structural and sequence properties of beta-strands in the beta-hairpins were analyzed, which revealed that the beta-strands showed no specific sequence and structural patterns for turn types. The analytical results in this study are expected to be a reference in the engineering or design of beta-hairpin turn structures and sequences. © 2014 Wiley Periodicals, Inc.

  5. Within-genome evolution of REPINs: a new family of miniature mobile DNA in bacteria.

    PubMed

    Bertels, Frederic; Rainey, Paul B

    2011-06-01

    Repetitive sequences are a conserved feature of many bacterial genomes. While first reported almost thirty years ago, and frequently exploited for genotyping purposes, little is known about their origin, maintenance, or processes affecting the dynamics of within-genome evolution. Here, beginning with analysis of the diversity and abundance of short oligonucleotide sequences in the genome of Pseudomonas fluorescens SBW25, we show that over-represented short sequences define three distinct groups (GI, GII, and GIII) of repetitive extragenic palindromic (REP) sequences. Patterns of REP distribution suggest that closely linked REP sequences form a functional replicative unit: REP doublets are over-represented, randomly distributed in extragenic space, and more highly conserved than singlets. In addition, doublets are organized as inverted repeats, which together with intervening spacer sequences are predicted to form hairpin structures in ssDNA or mRNA. We refer to these newly defined entities as REPINs (REP doublets forming hairpins) and identify short reads from population sequencing that reveal putative transposition intermediates. The proximal relationship between GI, GII, and GIII REPINs and specific REP-associated tyrosine transposases (RAYTs), combined with features of the putative transposition intermediate, suggests a mechanism for within-genome dissemination. Analysis of the distribution of REPs in a range of RAYT-containing bacterial genomes, including Escherichia coli K-12 and Nostoc punctiforme, show that REPINs are a widely distributed, but hitherto unrecognized, family of miniature non-autonomous mobile DNA.

  6. Discovery radiomics via evolutionary deep radiomic sequencer discovery for pathologically proven lung cancer detection.

    PubMed

    Shafiee, Mohammad Javad; Chung, Audrey G; Khalvati, Farzad; Haider, Masoom A; Wong, Alexander

    2017-10-01

    While lung cancer is the second most diagnosed form of cancer in men and women, a sufficiently early diagnosis can be pivotal in patient survival rates. Imaging-based, or radiomics-driven, detection methods have been developed to aid diagnosticians, but largely rely on hand-crafted features that may not fully encapsulate the differences between cancerous and healthy tissue. Recently, the concept of discovery radiomics was introduced, where custom abstract features are discovered from readily available imaging data. We propose an evolutionary deep radiomic sequencer discovery approach based on evolutionary deep intelligence. Motivated by patient privacy concerns and the idea of operational artificial intelligence, the evolutionary deep radiomic sequencer discovery approach organically evolves increasingly more efficient deep radiomic sequencers that produce significantly more compact yet similarly descriptive radiomic sequences over multiple generations. As a result, this framework improves operational efficiency and enables diagnosis to be run locally at the radiologist's computer while maintaining detection accuracy. We evaluated the evolved deep radiomic sequencer (EDRS) discovered via the proposed evolutionary deep radiomic sequencer discovery framework against state-of-the-art radiomics-driven and discovery radiomics methods using clinical lung CT data with pathologically proven diagnostic data from the LIDC-IDRI dataset. The EDRS shows improved sensitivity (93.42%), specificity (82.39%), and diagnostic accuracy (88.78%) relative to previous radiomics approaches.

  7. MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction.

    PubMed

    Wang, Duolin; Zeng, Shuai; Xu, Chunhui; Qiu, Wangren; Liang, Yanchun; Joshi, Trupti; Xu, Dong

    2017-12-15

    Computational methods for phosphorylation site prediction play important roles in protein function studies and experimental design. Most existing methods are based on feature extraction, which may result in incomplete or biased features. Deep learning as the cutting-edge machine learning method has the ability to automatically discover complex representations of phosphorylation patterns from the raw sequences, and hence it provides a powerful tool for improvement of phosphorylation site prediction. We present MusiteDeep, the first deep-learning framework for predicting general and kinase-specific phosphorylation sites. MusiteDeep takes raw sequence data as input and uses convolutional neural networks with a novel two-dimensional attention mechanism. It achieves over a 50% relative improvement in the area under the precision-recall curve in general phosphorylation site prediction and obtains competitive results in kinase-specific prediction compared to other well-known tools on the benchmark data. MusiteDeep is provided as an open-source tool available at https://github.com/duolinwang/MusiteDeep. xudong@missouri.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  8. Formation of a functional maize centromere after loss of centromeric sequences and gain of ectopic sequences.

    PubMed

    Zhang, Bing; Lv, Zhenling; Pang, Junling; Liu, Yalin; Guo, Xiang; Fu, Shulan; Li, Jun; Dong, Qianhua; Wu, Hua-Jun; Gao, Zhi; Wang, Xiu-Jie; Han, Fangpu

    2013-06-01

    The maize (Zea mays) B centromere is composed of B centromere-specific repeats (ZmBs), centromere-specific satellite repeats (CentC), and centromeric retrotransposons of maize (CRM). Here we describe a newly formed B centromere in maize, which has lost CentC sequences and has dramatically reduced CRM and ZmBs sequences, but still retains the molecular features of functional centromeres, such as CENH3, H2A phosphorylation at Thr-133, H3 phosphorylation at Ser-10, and Thr-3 immunostaining signals. This new centromere is stable and can be transmitted to offspring through meiosis. Anti-CENH3 chromatin immunoprecipitation sequencing revealed that a 723-kb region from the short arm of chromosome 9 (9S) was involved in the formation of the new centromere. The 723-kb region, which is gene poor and enriched for transposons, contains two abundant DNA motifs. Genes in the new centromere region are still transcribed. The original 723-kb region showed a higher DNA methylation level compared with native centromeres but was not significantly changed when it was involved in new centromere formation. Our results indicate that functional centromeres may be formed without the known centromere-specific sequences, yet the maintenance of a high DNA methylation level seems to be crucial for the proper function of a new centromere.

  9. Formation of a Functional Maize Centromere after Loss of Centromeric Sequences and Gain of Ectopic Sequences[C][W

    PubMed Central

    Zhang, Bing; Lv, Zhenling; Pang, Junling; Liu, Yalin; Guo, Xiang; Fu, Shulan; Li, Jun; Dong, Qianhua; Wu, Hua-Jun; Gao, Zhi; Wang, Xiu-Jie; Han, Fangpu

    2013-01-01

    The maize (Zea mays) B centromere is composed of B centromere–specific repeats (ZmBs), centromere-specific satellite repeats (CentC), and centromeric retrotransposons of maize (CRM). Here we describe a newly formed B centromere in maize, which has lost CentC sequences and has dramatically reduced CRM and ZmBs sequences, but still retains the molecular features of functional centromeres, such as CENH3, H2A phosphorylation at Thr-133, H3 phosphorylation at Ser-10, and Thr-3 immunostaining signals. This new centromere is stable and can be transmitted to offspring through meiosis. Anti-CENH3 chromatin immunoprecipitation sequencing revealed that a 723-kb region from the short arm of chromosome 9 (9S) was involved in the formation of the new centromere. The 723-kb region, which is gene poor and enriched for transposons, contains two abundant DNA motifs. Genes in the new centromere region are still transcribed. The original 723-kb region showed a higher DNA methylation level compared with native centromeres but was not significantly changed when it was involved in new centromere formation. Our results indicate that functional centromeres may be formed without the known centromere-specific sequences, yet the maintenance of a high DNA methylation level seems to be crucial for the proper function of a new centromere. PMID:23771890

  10. A survey of tools for variant analysis of next-generation genome sequencing data

    PubMed Central

    Pabinger, Stephan; Dander, Andreas; Fischer, Maria; Snajder, Rene; Sperk, Michael; Efremova, Mirjana; Krabichler, Birgit; Speicher, Michael R.; Zschocke, Johannes

    2014-01-01

    Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers. PMID:23341494

  11. COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features.

    PubMed

    Hu, Long; Xu, Zhiyu; Hu, Boqin; Lu, Zhi John

    2017-01-09

    Recent genomic studies suggest that novel long non-coding RNAs (lncRNAs) are specifically expressed and far outnumber annotated lncRNA sequences. To identify and characterize novel lncRNAs in RNA sequencing data from new samples, we have developed COME, a coding potential calculation tool based on multiple features. It integrates multiple sequence-derived and experiment-based features using a decompose-compose method, which makes it more accurate and robust than other well-known tools. We also showed that COME was able to substantially improve the consistency of predication results from other coding potential calculators. Moreover, COME annotates and characterizes each predicted lncRNA transcript with multiple lines of supporting evidence, which are not provided by other tools. Remarkably, we found that one subgroup of lncRNAs classified by such supporting features (i.e. conserved local RNA secondary structure) was highly enriched in a well-validated database (lncRNAdb). We further found that the conserved structural domains on lncRNAs had better chance than other RNA regions to interact with RNA binding proteins, based on the recent eCLIP-seq data in human, indicating their potential regulatory roles. Overall, we present COME as an accurate, robust and multiple-feature supported method for the identification and characterization of novel lncRNAs. The software implementation is available at https://github.com/lulab/COME. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Cellulose synthase 'class specific regions' are intrinsically disordered and functionally undifferentiated.

    PubMed

    Scavuzzo-Duggan, Tess R; Chaves, Arielle M; Singh, Abhishek; Sethaphong, Latsavongsakda; Slabaugh, Erin; Yingling, Yaroslava G; Haigler, Candace H; Roberts, Alison W

    2018-06-01

    Cellulose synthases (CESAs) are glycosyltransferases that catalyze formation of cellulose microfibrils in plant cell walls. Seed plant CESA isoforms cluster in six phylogenetic clades, whose non-interchangeable members play distinct roles within cellulose synthesis complexes (CSCs). A 'class specific region' (CSR), with higher sequence similarity within versus between functional CESA classes, has been suggested to contribute to specific activities or interactions of different isoforms. We investigated CESA isoform specificity in the moss, Physcomitrella patens (Hedw.) B. S. G. to gain evolutionary insights into CESA structure/function relationships. Like seed plants, P. patens has oligomeric rosette-type CSCs, but the PpCESAs diverged independently and form a separate CESA clade. We showed that P. patens has two functionally distinct CESAs classes, based on the ability to complement the gametophore-negative phenotype of a ppcesa5 knockout line. Thus, non-interchangeable CESA classes evolved separately in mosses and seed plants. However, testing of chimeric moss CESA genes for complementation demonstrated that functional class-specificity is not determined by the CSR. Sequence analysis and computational modeling showed that the CSR is intrinsically disordered and contains predicted molecular recognition features, consistent with a possible role in CESA oligomerization and explaining the evolution of class-specific sequences without selection for class-specific function. © 2018 Institute of Botany, Chinese Academy of Sciences.

  13. Low X/Y divergence in four pairs of papaya sex-linked genes.

    PubMed

    Yu, Qingyi; Hou, Shaobin; Feltus, F Alex; Jones, Meghan R; Murray, Jan E; Veatch, Olivia; Lemke, Cornelia; Saw, Jimmy H; Moore, Richard C; Thimmapuram, Jyothi; Liu, Lei; Moore, Paul H; Alam, Maqsudul; Jiang, Jiming; Paterson, Andrew H; Ming, Ray

    2008-01-01

    Sex chromosomes in flowering plants, in contrast to those in animals, evolved relatively recently and only a few are heteromorphic. The homomorphic sex chromosomes of papaya show features of incipient sex chromosome evolution. We investigated the features of paired X- and Y-specific bacterial artificial chromosomes (BACs), and estimated the time of divergence in four pairs of sex-linked genes. We report the results of a comparative analysis of long contiguous genomic DNA sequences between the X and hermaphrodite Y (Y(h)) chromosomes. Numerous chromosomal rearrangements were detected in the male-specific region of the Y chromosome (MSY), including inversions, deletions, insertions, duplications and translocations, showing the dynamic evolutionary process on the MSY after recombination ceased. DNA sequence expansion was documented in the two regions of the MSY, demonstrating that the cytologically homomorphic sex chromosomes are heteromorphic at the molecular level. Analysis of sequence divergence between four X and Y(h) gene pairs resulted in a estimated age of divergence of between 0.5 and 2.2 million years, supporting a recent origin of the papaya sex chromosomes. Our findings indicate that sex chromosomes did not evolve at the family level in Caricaceae, and reinforce the theory that sex chromosomes evolve at the species level in some lineages.

  14. Sequential strand displacement beacon for detection of DNA coverage on functionalized gold nanoparticles.

    PubMed

    Paliwoda, Rebecca E; Li, Feng; Reid, Michael S; Lin, Yanwen; Le, X Chris

    2014-06-17

    Functionalizing nanomaterials for diverse analytical, biomedical, and therapeutic applications requires determination of surface coverage (or density) of DNA on nanomaterials. We describe a sequential strand displacement beacon assay that is able to quantify specific DNA sequences conjugated or coconjugated onto gold nanoparticles (AuNPs). Unlike the conventional fluorescence assay that requires the target DNA to be fluorescently labeled, the sequential strand displacement beacon method is able to quantify multiple unlabeled DNA oligonucleotides using a single (universal) strand displacement beacon. This unique feature is achieved by introducing two short unlabeled DNA probes for each specific DNA sequence and by performing sequential DNA strand displacement reactions. Varying the relative amounts of the specific DNA sequences and spacing DNA sequences during their coconjugation onto AuNPs results in different densities of the specific DNA on AuNP, ranging from 90 to 230 DNA molecules per AuNP. Results obtained from our sequential strand displacement beacon assay are consistent with those obtained from the conventional fluorescence assays. However, labeling of DNA with some fluorescent dyes, e.g., tetramethylrhodamine, alters DNA density on AuNP. The strand displacement strategy overcomes this problem by obviating direct labeling of the target DNA. This method has broad potential to facilitate more efficient design and characterization of novel multifunctional materials for diverse applications.

  15. SD-MSAEs: Promoter recognition in human genome based on deep feature extraction.

    PubMed

    Xu, Wenxuan; Zhang, Li; Lu, Yaping

    2016-06-01

    The prediction and recognition of promoter in human genome play an important role in DNA sequence analysis. Entropy, in Shannon sense, of information theory is a multiple utility in bioinformatic details analysis. The relative entropy estimator methods based on statistical divergence (SD) are used to extract meaningful features to distinguish different regions of DNA sequences. In this paper, we choose context feature and use a set of methods of SD to select the most effective n-mers distinguishing promoter regions from other DNA regions in human genome. Extracted from the total possible combinations of n-mers, we can get four sparse distributions based on promoter and non-promoters training samples. The informative n-mers are selected by optimizing the differentiating extents of these distributions. Specially, we combine the advantage of statistical divergence and multiple sparse auto-encoders (MSAEs) in deep learning to extract deep feature for promoter recognition. And then we apply multiple SVMs and a decision model to construct a human promoter recognition method called SD-MSAEs. Framework is flexible that it can integrate new feature extraction or new classification models freely. Experimental results show that our method has high sensitivity and specificity. Copyright © 2016 Elsevier Inc. All rights reserved.

  16. Influence of time and length size feature selections for human activity sequences recognition.

    PubMed

    Fang, Hongqing; Chen, Long; Srinivasan, Raghavendiran

    2014-01-01

    In this paper, Viterbi algorithm based on a hidden Markov model is applied to recognize activity sequences from observed sensors events. Alternative features selections of time feature values of sensors events and activity length size feature values are tested, respectively, and then the results of activity sequences recognition performances of Viterbi algorithm are evaluated. The results show that the selection of larger time feature values of sensor events and/or smaller activity length size feature values will generate relatively better results on the activity sequences recognition performances. © 2013 ISA Published by ISA All rights reserved.

  17. Sequence, Structure, and Context Preferences of Human RNA Binding Proteins.

    PubMed

    Dominguez, Daniel; Freese, Peter; Alexis, Maria S; Su, Amanda; Hochman, Myles; Palden, Tsultrim; Bazile, Cassandra; Lambert, Nicole J; Van Nostrand, Eric L; Pratt, Gabriel A; Yeo, Gene W; Graveley, Brenton R; Burge, Christopher B

    2018-06-07

    RNA binding proteins (RBPs) orchestrate the production, processing, and function of mRNAs. Here, we present the affinity landscapes of 78 human RBPs using an unbiased assay that determines the sequence, structure, and context preferences of these proteins in vitro by deep sequencing of bound RNAs. These data enable construction of "RNA maps" of RBP activity without requiring crosslinking-based assays. We found an unexpectedly low diversity of RNA motifs, implying frequent convergence of binding specificity toward a relatively small set of RNA motifs, many with low compositional complexity. Offsetting this trend, however, we observed extensive preferences for contextual features distinct from short linear RNA motifs, including spaced "bipartite" motifs, biased flanking nucleotide composition, and bias away from or toward RNA structure. Our results emphasize the importance of contextual features in RNA recognition, which likely enable targeting of distinct subsets of transcripts by different RBPs that recognize the same linear motif. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.

  18. Mining for class-specific motifs in protein sequence classification

    PubMed Central

    2013-01-01

    Background In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. Results We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. Conclusion The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms. PMID:23496846

  19. Genome sequence of Methanobacterium congolense strain Buetzberg, a hydrogenotrophic, methanogenic archaeon, isolated from a mesophilic industrial-scale biogas plant utilizing bio-waste.

    PubMed

    Tejerizo, Gonzalo Torres; Kim, Yong Sung; Maus, Irena; Wibberg, Daniel; Winkler, Anika; Off, Sandra; Pühler, Alfred; Scherer, Paul; Schlüter, Andreas

    2017-04-10

    Methanogenic Archaea are of importance at the end of the anaerobic digestion (AD) chain for biomass conversion. They finally produce methane, the end-product of AD. Among this group of microorganisms, members of the genus Methanobacterium are ubiquitously present in anaerobic habitats, such as bioreactors. The genome of a novel methanogenic archaeon, namely Methanobacterium congolense Buetzberg, originally isolated from a mesophilic biogas plant, was completely sequenced to analyze putative adaptive genome features conferring competitiveness of this isolate within the biogas reactor environment. Sequencing and assembly of the M. congolense Buetzberg genome yielded a chromosome with a size of 2,451,457bp and a mean GC-content of 38.51%. Additionally, a plasmid with a size of 18,118bp, featuring a GC content of 36.05% was identified. The M. congolense Buetzberg plasmid showed no sequence similarities with the plasmids described previously suggesting that it represents a new plasmid type. Analysis of the M. congolense Buetzberg chromosome architecture revealed a high collinearity with the Methanobacterium paludis chromosome. Furthermore, annotation of the genome and functional predictions disclosed several genes involved in cell wall and membrane biogenesis. Compilation of specific genes among Methanobacterium strains originating from AD environments revealed 474 genetic determinants that could be crucial for adaptation of these strains to specific conditions prevailing in AD habitats. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. 2000 Year-old ancient equids: an ancient-DNA lesson from pompeii remains.

    PubMed

    Di Bernardo, Giovanni; Del Gaudio, Stefania; Galderisi, Umberto; Cipollaro, Marilena

    2004-11-15

    Ancient DNA extracted from 2000 year-old equine bones was examined in order to amplify mitochondrial and nuclear DNA fragments. A specific equine satellite-type sequence representing 3.7%-11% of the entire equine genome, proved to be a suitable target to address the question of the presence of aDNA in ancient bones. The PCR strategy designed to investigate this specific target also allowed us to calculate the molecular weight of amplifiable DNA fragments. Sequencing of a 370 bp DNA fragment of mitochondrial control region allowed the comparison of ancient DNA sequences with those of modern horses to assess their genetic relationship. The 16S rRNA mitochondrial gene was also examined to unravel the post-mortem base modification feature and to test the status of Pompeian equids taxon on the basis of a Mae III restriction site polymorphism. Copyright 2004 Wiley-Liss, Inc.

  1. Simultaneous activation of parallel sensory pathways promotes a grooming sequence in Drosophila

    PubMed Central

    Hampel, Stefanie; McKellar, Claire E

    2017-01-01

    A central model that describes how behavioral sequences are produced features a neural architecture that readies different movements simultaneously, and a mechanism where prioritized suppression between the movements determines their sequential performance. We previously described a model whereby suppression drives a Drosophila grooming sequence that is induced by simultaneous activation of different sensory pathways that each elicit a distinct movement (Seeds et al., 2014). Here, we confirm this model using transgenic expression to identify and optogenetically activate sensory neurons that elicit specific grooming movements. Simultaneous activation of different sensory pathways elicits a grooming sequence that resembles the naturally induced sequence. Moreover, the sequence proceeds after the sensory excitation is terminated, indicating that a persistent trace of this excitation induces the next grooming movement once the previous one is performed. This reveals a mechanism whereby parallel sensory inputs can be integrated and stored to elicit a delayed and sequential grooming response. PMID:28887878

  2. Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC.

    PubMed

    Sabooh, M Fazli; Iqbal, Nadeem; Khan, Mukhtaj; Khan, Muslim; Maqbool, H F

    2018-05-01

    This study examines accurate and efficient computational method for identification of 5-methylcytosine sites in RNA modification. The occurrence of 5-methylcytosine (m 5 C) plays a vital role in a number of biological processes. For better comprehension of the biological functions and mechanism it is necessary to recognize m 5 C sites in RNA precisely. The laboratory techniques and procedures are available to identify m 5 C sites in RNA, but these procedures require a lot of time and resources. This study develops a new computational method for extracting the features of RNA sequence. In this method, first the RNA sequence is encoded via composite feature vector, then, for the selection of discriminate features, the minimum-redundancy-maximum-relevance algorithm was used. Secondly, the classification method used has been based on a support vector machine by using jackknife cross validation test. The suggested method efficiently identifies m 5 C sites from non- m 5 C sites and the outcome of the suggested algorithm is 93.33% with sensitivity of 90.0 and specificity of 96.66 on bench mark datasets. The result exhibits that proposed algorithm shown significant identification performance compared to the existing computational techniques. This study extends the knowledge about the occurrence sites of RNA modification which paves the way for better comprehension of the biological uses and mechanism. Copyright © 2018 Elsevier Ltd. All rights reserved.

  3. GBshape: a genome browser database for DNA shape annotations.

    PubMed

    Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin; Main, Bradley J; Parker, Stephen C J; Nuzhdin, Sergey V; Tullius, Thomas D; Rohs, Remo

    2015-01-01

    Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. Sequence-dependent DNA deformability studied using molecular dynamics simulations.

    PubMed

    Fujii, Satoshi; Kono, Hidetoshi; Takenaka, Shigeori; Go, Nobuhiro; Sarai, Akinori

    2007-01-01

    Proteins recognize specific DNA sequences not only through direct contact between amino acids and bases, but also indirectly based on the sequence-dependent conformation and deformability of the DNA (indirect readout). We used molecular dynamics simulations to analyze the sequence-dependent DNA conformations of all 136 possible tetrameric sequences sandwiched between CGCG sequences. The deformability of dimeric steps obtained by the simulations is consistent with that by the crystal structures. The simulation results further showed that the conformation and deformability of the tetramers can highly depend on the flanking base pairs. The conformations of xATx tetramers show the most rigidity and are not affected by the flanking base pairs and the xYRx show by contrast the greatest flexibility and change their conformations depending on the base pairs at both ends, suggesting tetramers with the same central dimer can show different deformabilities. These results suggest that analysis of dimeric steps alone may overlook some conformational features of DNA and provide insight into the mechanism of indirect readout during protein-DNA recognition. Moreover, the sequence dependence of DNA conformation and deformability may be used to estimate the contribution of indirect readout to the specificity of protein-DNA recognition as well as nucleosome positioning and large-scale behavior of nucleic acids.

  5. Processing Dynamic Image Sequences from a Moving Sensor.

    DTIC Science & Technology

    1984-02-01

    65 Roadsign Image Sequence ..... ................ ... 70 Roadsign Sequence with Redundant Features .. ........ . 79 Roadsign Subimage...Selected Feature Error Values .. ........ 66 2c. Industrial Image Selected Feature Local Search Values. .. .... 67 3ab. Roadsign Image Error Values...72 3c. Roadsign Image Local Search Values ............. 73 4ab. Roadsign Redundant Feature Error Values. ............ 8 4c. Roadsign

  6. An RNAi in silico approach to find an optimal shRNA cocktail against HIV-1

    PubMed Central

    2010-01-01

    Background HIV-1 can be inhibited by RNA interference in vitro through the expression of short hairpin RNAs (shRNAs) that target conserved genome sequences. In silico shRNA design for HIV has lacked a detailed study of virus variability constituting a possible breaking point in a clinical setting. We designed shRNAs against HIV-1 considering the variability observed in naïve and drug-resistant isolates available at public databases. Methods A Bioperl-based algorithm was developed to automatically scan multiple sequence alignments of HIV, while evaluating the possibility of identifying dominant and subdominant viral variants that could be used as efficient silencing molecules. Student t-test and Bonferroni Dunn correction test were used to assess statistical significance of our findings. Results Our in silico approach identified the most common viral variants within highly conserved genome regions, with a calculated free energy of ≥ -6.6 kcal/mol. This is crucial for strand loading to RISC complex and for a predicted silencing efficiency score, which could be used in combination for achieving over 90% silencing. Resistant and naïve isolate variability revealed that the most frequent shRNA per region targets a maximum of 85% of viral sequences. Adding more divergent sequences maintained this percentage. Specific sequence features that have been found to be related with higher silencing efficiency were hardly accomplished in conserved regions, even when lower entropy values correlated with better scores. We identified a conserved region among most HIV-1 genomes, which meets as many sequence features for efficient silencing. Conclusions HIV-1 variability is an obstacle to achieving absolute silencing using shRNAs designed against a consensus sequence, mainly because there are many functional viral variants. Our shRNA cocktail could be truly effective at silencing dominant and subdominant naïve viral variants. Additionally, resistant isolates might be targeted under specific antiretroviral selective pressure, but in both cases these should be tested exhaustively prior to clinical use. PMID:21172023

  7. MACSIMS : multiple alignment of complete sequences information management system

    PubMed Central

    Thompson, Julie D; Muller, Arnaud; Waterhouse, Andrew; Procter, Jim; Barton, Geoffrey J; Plewniak, Frédéric; Poch, Olivier

    2006-01-01

    Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at . PMID:16792820

  8. LoopX: A Graphical User Interface-Based Database for Comprehensive Analysis and Comparative Evaluation of Loops from Protein Structures.

    PubMed

    Kadumuri, Rajashekar Varma; Vadrevu, Ramakrishna

    2017-10-01

    Due to their crucial role in function, folding, and stability, protein loops are being targeted for grafting/designing to create novel or alter existing functionality and improve stability and foldability. With a view to facilitate a thorough analysis and effectual search options for extracting and comparing loops for sequence and structural compatibility, we developed, LoopX a comprehensively compiled library of sequence and conformational features of ∼700,000 loops from protein structures. The database equipped with a graphical user interface is empowered with diverse query tools and search algorithms, with various rendering options to visualize the sequence- and structural-level information along with hydrogen bonding patterns, backbone φ, ψ dihedral angles of both the target and candidate loops. Two new features (i) conservation of the polar/nonpolar environment and (ii) conservation of sequence and conformation of specific residues within the loops have also been incorporated in the search and retrieval of compatible loops for a chosen target loop. Thus, the LoopX server not only serves as a database and visualization tool for sequence and structural analysis of protein loops but also aids in extracting and comparing candidate loops for a given target loop based on user-defined search options.

  9. Position-dependent effects of locked nucleic acid (LNA) on DNA sequencing and PCR primers

    PubMed Central

    Levin, Joshua D.; Fiala, Dean; Samala, Meinrado F.; Kahn, Jason D.; Peterson, Raymond J.

    2006-01-01

    Genomes are becoming heavily annotated with important features. Analysis of these features often employs oligonucleotides that hybridize at defined locations. When the defined location lies in a poor sequence context, traditional design strategies may fail. Locked Nucleic Acid (LNA) can enhance oligonucleotide affinity and specificity. Though LNA has been used in many applications, formal design rules are still being defined. To further this effort we have investigated the effect of LNA on the performance of sequencing and PCR primers in AT-rich regions, where short primers yield poor sequencing reads or PCR yields. LNA was used in three positional patterns: near the 5′ end (LNA-5′), near the 3′ end (LNA-3′) and distributed throughout (LNA-Even). Quantitative measures of sequencing read length (Phred Q30 count) and real-time PCR signal (cycle threshold, CT) were characterized using two-way ANOVA. LNA-5′ increased the average Phred Q30 score by 60% and it was never observed to decrease performance. LNA-5′ generated cycle thresholds in quantitative PCR that were comparable to high-yielding conventional primers. In contrast, LNA-3′ and LNA-Even did not improve read lengths or CT. ANOVA demonstrated the statistical significance of these results and identified significant interaction between the positional design rule and primer sequence. PMID:17071964

  10. TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences

    PubMed Central

    Song, Jiangning; Tan, Hao; Wang, Mingjun; Webb, Geoffrey I.; Akutsu, Tatsuya

    2012-01-01

    Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/. PMID:22319565

  11. Mental Ability and Mismatch Negativity: Pre-Attentive Discrimination of Abstract Feature Conjunctions in Auditory Sequences

    ERIC Educational Resources Information Center

    Houlihan, Michael; Stelmack, Robert M.

    2012-01-01

    The relation between mental ability and the ability to detect violations of an abstract, third-order conjunction rule was examined using event-related potential measures, specifically mismatch negativity (MMN). The primary objective was to determine whether the extraction of invariant relations based on abstract conjunctions between two…

  12. A Typology for Finite Groups

    ERIC Educational Resources Information Center

    Tou, Erik R

    2013-01-01

    This project classifies groups of small order using a group's center as the key feature. Groups of a given order "n" are typed based on the order of each group's center. Students are led through a sequence of exercises that combine proof-writing, independent research, and an analysis of specific classes of finite groups…

  13. Identification of functional features of synthetic SINEUPs, antisense lncRNAs that specifically enhance protein translation

    PubMed Central

    Kozhuharova, Ana; Sharma, Harshita; Ohyama, Takako; Fasolo, Francesca; Yamazaki, Toshio; Cotella, Diego; Santoro, Claudio; Zucchelli, Silvia; Gustincich, Stefano; Carninci, Piero

    2018-01-01

    SINEUPs are antisense long noncoding RNAs, in which an embedded SINE B2 element UP-regulates translation of partially overlapping target sense mRNAs. SINEUPs contain two functional domains. First, the binding domain (BD) is located in the region antisense to the target, providing specific targeting to the overlapping mRNA. Second, the inverted SINE B2 represents the effector domain (ED) and enhances translation. To adapt SINEUP technology to a broader number of targets, we took advantage of a high-throughput, semi-automated imaging system to optimize synthetic SINEUP BD and ED design in HEK293T cell lines. Using SINEUP-GFP as a model SINEUP, we extensively screened variants of the BD to map features needed for optimal design. We found that most active SINEUPs overlap an AUG-Kozak sequence. Moreover, we report our screening of the inverted SINE B2 sequence to identify active sub-domains and map the length of the minimal active ED. Our synthetic SINEUP-GFP screening of both BDs and EDs constitutes a broad test with flexible applications to any target gene of interest. PMID:29414979

  14. Improved detection of DNA-binding proteins via compression technology on PSSM information.

    PubMed

    Wang, Yubo; Ding, Yijie; Guo, Fei; Wei, Leyi; Tang, Jijun

    2017-01-01

    Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix-Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix-Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.

  15. Formin homology 2 domains occur in multiple contexts in angiosperms

    PubMed Central

    Cvrčková, Fatima; Novotný, Marian; Pícková, Denisa; Žárský, Viktor

    2004-01-01

    Background Involvement of conservative molecular modules and cellular mechanisms in the widely diversified processes of eukaryotic cell morphogenesis leads to the intriguing question: how do similar proteins contribute to dissimilar morphogenetic outputs. Formins (FH2 proteins) play a central part in the control of actin organization and dynamics, providing a good example of evolutionarily versatile use of a conserved protein domain in the context of a variety of lineage-specific structural and signalling interactions. Results In order to identify possible plant-specific sequence features within the FH2 protein family, we performed a detailed analysis of angiosperm formin-related sequences available in public databases, with particular focus on the complete Arabidopsis genome and the nearly finished rice genome sequence. This has led to revision of the current annotation of half of the 22 Arabidopsis formin-related genes. Comparative analysis of the two plant genomes revealed a good conservation of the previously described two subfamilies of plant formins (Class I and Class II), as well as several subfamilies within them that appear to predate the separation of monocot and dicot plants. Moreover, a number of plant Class II formins share an additional conserved domain, related to the protein phosphatase/tensin/auxilin fold. However, considerable inter-species variability sets limits to generalization of any functional conclusions reached on a single species such as Arabidopsis. Conclusions The plant-specific domain context of the conserved FH2 domain, as well as plant-specific features of the domain itself, may reflect distinct functional requirements in plant cells. The variability of formin structures found in plants far exceeds that known from both fungi and metazoans, suggesting a possible contribution of FH2 proteins in the evolution of the plant type of multicellularity. PMID:15256004

  16. A novel monoclonal Perkinsus chesapeaki in vitro isolate from an Australian cockle, Anadara trapezia.

    PubMed

    Reece, Kimberly S; Scott, Gail P; Dang, Cécile; Dungan, Christopher F

    2017-09-01

    A monoclonal Perkinsus chesapeaki isolate was established from 1 of 10 infected Australian Anadara trapezia cockles. Morphological features were similar to those of described P. chesapeaki isolates, and also included a unique vermiform schizont cell-type. Perkinsus olseni-specific PCR primers amplified DNAs from all 10 cockles. Perkinsus chesapeaki-specific primers also amplified DNAs from 4/10 cockles, including DNA from the isolate source cockle. Three different sets of DNA sequences from the monoclonal isolate grouped with the homologous, previously deposited, P. chesapeaki sequences in phylogenetic analyses. In situ hybridization assays detected both P. chesapeaki and P. olseni cells in histological sections from the source cockle for monoclonal isolate ATCC PRA-425. Copyright © 2017 Elsevier Inc. All rights reserved.

  17. Within-Genome Evolution of REPINs: a New Family of Miniature Mobile DNA in Bacteria

    PubMed Central

    Bertels, Frederic; Rainey, Paul B.

    2011-01-01

    Repetitive sequences are a conserved feature of many bacterial genomes. While first reported almost thirty years ago, and frequently exploited for genotyping purposes, little is known about their origin, maintenance, or processes affecting the dynamics of within-genome evolution. Here, beginning with analysis of the diversity and abundance of short oligonucleotide sequences in the genome of Pseudomonas fluorescens SBW25, we show that over-represented short sequences define three distinct groups (GI, GII, and GIII) of repetitive extragenic palindromic (REP) sequences. Patterns of REP distribution suggest that closely linked REP sequences form a functional replicative unit: REP doublets are over-represented, randomly distributed in extragenic space, and more highly conserved than singlets. In addition, doublets are organized as inverted repeats, which together with intervening spacer sequences are predicted to form hairpin structures in ssDNA or mRNA. We refer to these newly defined entities as REPINs (REP doublets forming hairpins) and identify short reads from population sequencing that reveal putative transposition intermediates. The proximal relationship between GI, GII, and GIII REPINs and specific REP-associated tyrosine transposases (RAYTs), combined with features of the putative transposition intermediate, suggests a mechanism for within-genome dissemination. Analysis of the distribution of REPs in a range of RAYT–containing bacterial genomes, including Escherichia coli K-12 and Nostoc punctiforme, show that REPINs are a widely distributed, but hitherto unrecognized, family of miniature non-autonomous mobile DNA. PMID:21698139

  18. Draft genome sequence of Halomonas lutea strain YIM 91125 T (DSM 23508 T) isolated from the alkaline Lake Ebinur in Northwest China

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gao, Xiao-Yang; Zhi, Xiao-Yang; Li, Hong-Wei

    Species of the genus Halomonas are halophilic and their flexible adaption to changes of salinity and temperature brings considerable potential biotechnology applications, such as degradation of organic pollutants and enzyme production. The type strain Halomonas lutea YIM 91125 T was isolated from a hypersaline lake in China. The genome of strain YIM 91125 T becomes the twelfth species sequenced in Halomonas, and the thirteenth species sequenced in Halomonadaceae. We described the features of H. lutea YIM 91125 T, together with the high quality draft genome sequence and annotation of its type strain. The 4,533,090 bp long genome of strain YIMmore » 91125 T with its 4,284 protein-coding and 84 RNA genes is a part of Genomic Encyclopedia of Type Strains, Phase I: the one thousand microbial genomes (KMG-I) project. From the viewpoint of comparative genomics, H. lutea has a larger genome size and more specific genes, which indicated acquisition of function bringing better adaption to its environment. Finally, DDH analysis demonstrated that H. lutea is a distinctive species, and halophilic features and nitrogen metabolism related genes were discovered in its genome.« less

  19. Structurally detailed coarse-grained model for Sec-facilitated co-translational protein translocation and membrane integration

    PubMed Central

    Miller, Thomas F.

    2017-01-01

    We present a coarse-grained simulation model that is capable of simulating the minute-timescale dynamics of protein translocation and membrane integration via the Sec translocon, while retaining sufficient chemical and structural detail to capture many of the sequence-specific interactions that drive these processes. The model includes accurate geometric representations of the ribosome and Sec translocon, obtained directly from experimental structures, and interactions parameterized from nearly 200 μs of residue-based coarse-grained molecular dynamics simulations. A protocol for mapping amino-acid sequences to coarse-grained beads enables the direct simulation of trajectories for the co-translational insertion of arbitrary polypeptide sequences into the Sec translocon. The model reproduces experimentally observed features of membrane protein integration, including the efficiency with which polypeptide domains integrate into the membrane, the variation in integration efficiency upon single amino-acid mutations, and the orientation of transmembrane domains. The central advantage of the model is that it connects sequence-level protein features to biological observables and timescales, enabling direct simulation for the mechanistic analysis of co-translational integration and for the engineering of membrane proteins with enhanced membrane integration efficiency. PMID:28328943

  20. Analysis and Visualization of ChIP-Seq and RNA-Seq Sequence Alignments Using ngs.plot.

    PubMed

    Loh, Yong-Hwee Eddie; Shen, Li

    2016-01-01

    The continual maturation and increasing applications of next-generation sequencing technology in scientific research have yielded ever-increasing amounts of data that need to be effectively and efficiently analyzed and innovatively mined for new biological insights. We have developed ngs.plot-a quick and easy-to-use bioinformatics tool that performs visualizations of the spatial relationships between sequencing alignment enrichment and specific genomic features or regions. More importantly, ngs.plot is customizable beyond the use of standard genomic feature databases to allow the analysis and visualization of user-specified regions of interest generated by the user's own hypotheses. In this protocol, we demonstrate and explain the use of ngs.plot using command line executions, as well as a web-based workflow on the Galaxy framework. We replicate the underlying commands used in the analysis of a true biological dataset that we had reported and published earlier and demonstrate how ngs.plot can easily generate publication-ready figures. With ngs.plot, users would be able to efficiently and innovatively mine their own datasets without having to be involved in the technical aspects of sequence coverage calculations and genomic databases.

  1. Draft genome sequence of Halomonas lutea strain YIM 91125 T (DSM 23508 T) isolated from the alkaline Lake Ebinur in Northwest China

    DOE PAGES

    Gao, Xiao-Yang; Zhi, Xiao-Yang; Li, Hong-Wei; ...

    2015-01-20

    Species of the genus Halomonas are halophilic and their flexible adaption to changes of salinity and temperature brings considerable potential biotechnology applications, such as degradation of organic pollutants and enzyme production. The type strain Halomonas lutea YIM 91125 T was isolated from a hypersaline lake in China. The genome of strain YIM 91125 T becomes the twelfth species sequenced in Halomonas, and the thirteenth species sequenced in Halomonadaceae. We described the features of H. lutea YIM 91125 T, together with the high quality draft genome sequence and annotation of its type strain. The 4,533,090 bp long genome of strain YIMmore » 91125 T with its 4,284 protein-coding and 84 RNA genes is a part of Genomic Encyclopedia of Type Strains, Phase I: the one thousand microbial genomes (KMG-I) project. From the viewpoint of comparative genomics, H. lutea has a larger genome size and more specific genes, which indicated acquisition of function bringing better adaption to its environment. Finally, DDH analysis demonstrated that H. lutea is a distinctive species, and halophilic features and nitrogen metabolism related genes were discovered in its genome.« less

  2. Structural and functional partitioning of bread wheat chromosome 3B.

    PubMed

    Choulet, Frédéric; Alberti, Adriana; Theil, Sébastien; Glover, Natasha; Barbe, Valérie; Daron, Josquin; Pingault, Lise; Sourdille, Pierre; Couloux, Arnaud; Paux, Etienne; Leroy, Philippe; Mangenot, Sophie; Guilhot, Nicolas; Le Gouis, Jacques; Balfourier, Francois; Alaux, Michael; Jamilloux, Véronique; Poulain, Julie; Durand, Céline; Bellec, Arnaud; Gaspin, Christine; Safar, Jan; Dolezel, Jaroslav; Rogers, Jane; Vandepoele, Klaas; Aury, Jean-Marc; Mayer, Klaus; Berges, Hélène; Quesneville, Hadi; Wincker, Patrick; Feuillet, Catherine

    2014-07-18

    We produced a reference sequence of the 1-gigabase chromosome 3B of hexaploid bread wheat. By sequencing 8452 bacterial artificial chromosomes in pools, we assembled a sequence of 774 megabases carrying 5326 protein-coding genes, 1938 pseudogenes, and 85% of transposable elements. The distribution of structural and functional features along the chromosome revealed partitioning correlated with meiotic recombination. Comparative analyses indicated high wheat-specific inter- and intrachromosomal gene duplication activities that are potential sources of variability for adaption. In addition to providing a better understanding of the organization, function, and evolution of a large and polyploid genome, the availability of a high-quality sequence anchored to genetic maps will accelerate the identification of genes underlying important agronomic traits. Copyright © 2014, American Association for the Advancement of Science.

  3. Compartmentalization of HIV-1 within the female genital tract is due to monotypic and low-diversity variants not distinct viral populations.

    PubMed

    Bull, Marta; Learn, Gerald; Genowati, Indira; McKernan, Jennifer; Hitti, Jane; Lockhart, David; Tapia, Kenneth; Holte, Sarah; Dragavon, Joan; Coombs, Robert; Mullins, James; Frenkel, Lisa

    2009-09-22

    Compartmentalization of HIV-1 between the genital tract and blood was noted in half of 57 women included in 12 studies primarily using cell-free virus. To further understand differences between genital tract and blood viruses of women with chronic HIV-1 infection cell-free and cell-associated virus populations were sequenced from these tissues, reasoning that integrated viral DNA includes variants archived from earlier in infection, and provides a greater array of genotypes for comparisons. Multiple sequences from single-genome-amplification of HIV-1 RNA and DNA from the genital tract and blood of each woman were compared in a cross-sectional study. Maximum likelihood phylogenies were evaluated for evidence of compartmentalization using four statistical tests. Genital tract and blood HIV-1 appears compartmentalized in 7/13 women by >/=2 statistical analyses. These subjects' phylograms were characterized by low diversity genital-specific viral clades interspersed between clades containing both genital and blood sequences. Many of the genital-specific clades contained monotypic HIV-1 sequences. In 2/7 women, HIV-1 populations were significantly compartmentalized across all four statistical tests; both had low diversity genital tract-only clades. Collapsing monotypic variants into a single sequence diminished the prevalence and extent of compartmentalization. Viral sequences did not demonstrate tissue-specific signature amino acid residues, differential immune selection, or co-receptor usage. In women with chronic HIV-1 infection multiple identical sequences suggest proliferation of HIV-1-infected cells, and low diversity tissue-specific phylogenetic clades are consistent with bursts of viral replication. These monotypic and tissue-specific viruses provide statistical support for compartmentalization of HIV-1 between the female genital tract and blood. However, the intermingling of these clades with clades comprised of both genital and blood sequences and the absence of tissue-specific genetic features suggests compartmentalization between blood and genital tract may be due to viral replication and proliferation of infected cells, and questions whether HIV-1 in the female genital tract is distinct from blood.

  4. Sample sequencing of vascular plants demonstrates widespread conservation and divergence of microRNAs.

    PubMed

    Chávez Montes, Ricardo A; de Fátima Rosas-Cárdenas, Flor; De Paoli, Emanuele; Accerbi, Monica; Rymarquis, Linda A; Mahalingam, Gayathri; Marsch-Martínez, Nayelli; Meyers, Blake C; Green, Pamela J; de Folter, Stefan

    2014-04-23

    Small RNAs are pivotal regulators of gene expression that guide transcriptional and post-transcriptional silencing mechanisms in eukaryotes, including plants. Here we report a comprehensive atlas of sRNA and miRNA from 3 species of algae and 31 representative species across vascular plants, including non-model plants. We sequence and quantify sRNAs from 99 different tissues or treatments across species, resulting in a data set of over 132 million distinct sequences. Using miRBase mature sequences as a reference, we identify the miRNA sequences present in these libraries. We apply diverse profiling methods to examine critical sRNA and miRNA features, such as size distribution, tissue-specific regulation and sequence conservation between species, as well as to predict putative new miRNA sequences. We also develop database resources, computational analysis tools and a dedicated website, http://smallrna.udel.edu/. This study provides new insights on plant sRNAs and miRNAs, and a foundation for future studies.

  5. An electrostatic selection mechanism controls sequential kinase signaling downstream of the T cell receptor

    PubMed Central

    Shah, Neel H; Wang, Qi; Yan, Qingrong; Karandur, Deepti; Kadlecek, Theresa A; Fallahee, Ian R; Russ, William P; Ranganathan, Rama; Weiss, Arthur; Kuriyan, John

    2016-01-01

    The sequence of events that initiates T cell signaling is dictated by the specificities and order of activation of the tyrosine kinases that signal downstream of the T cell receptor. Using a platform that combines exhaustive point-mutagenesis of peptide substrates, bacterial surface-display, cell sorting, and deep sequencing, we have defined the specificities of the first two kinases in this pathway, Lck and ZAP-70, for the T cell receptor ζ chain and the scaffold proteins LAT and SLP-76. We find that ZAP-70 selects its substrates by utilizing an electrostatic mechanism that excludes substrates with positively-charged residues and favors LAT and SLP-76 phosphosites that are surrounded by negatively-charged residues. This mechanism prevents ZAP-70 from phosphorylating its own activation loop, thereby enforcing its strict dependence on Lck for activation. The sequence features in ZAP-70, LAT, and SLP-76 that underlie electrostatic selectivity likely contribute to the specific response of T cells to foreign antigens. DOI: http://dx.doi.org/10.7554/eLife.20105.001 PMID:27700984

  6. A "signal on" protection-displacement-hybridization-based electrochemical hepatitis B virus gene sequence sensor with high sensitivity and peculiar adjustable specificity.

    PubMed

    Li, Fengqin; Xu, Yanmei; Yu, Xiang; Yu, Zhigang; He, Xunjun; Ji, Hongrui; Dong, Jinghao; Song, Yongbin; Yan, Hong; Zhang, Guiling

    2016-08-15

    One "signal on" electrochemical sensing strategy was constructed for the detection of a specific hepatitis B virus (HBV) gene sequence based on the protection-displacement-hybridization-based (PDHB) signaling mechanism. This sensing system is composed of three probes, one capturing probe (CP) and one assistant probe (AP) which are co-immobilized on the Au electrode surface, and one 3-methylene blue (MB) modified signaling probe (SP) free in the detection solution. One duplex are formed between AP and SP with the target, a specific HBV gene sequence, hybridizing with CP. This structure can drive the MB labels close to the electrode surface, thereby producing a large detection current. Two electrochemical testing techniques, alternating current voltammetry (ACV) and cyclic voltammetry (CV), were used for characterizing the sensor. Under the optimized conditions, the proposed sensor exhibits a high sensitivity with the detection limit of ∼5fM for the target. When used for the discrimination of point mutation, the sensor also features an outstanding ability and its peculiar high adjustability. Copyright © 2016 Elsevier B.V. All rights reserved.

  7. Genomic insights into strategies used by Xanthomonas albilineans with its reduced artillery to spread within sugarcane xylem vessels.

    PubMed

    Pieretti, Isabelle; Royer, Monique; Barbe, Valérie; Carrere, Sébastien; Koebnik, Ralf; Couloux, Arnaud; Darrasse, Armelle; Gouzy, Jérôme; Jacques, Marie-Agnès; Lauber, Emmanuelle; Manceau, Charles; Mangenot, Sophie; Poussier, Stéphane; Segurens, Béatrice; Szurek, Boris; Verdier, Valérie; Arlat, Matthieu; Gabriel, Dean W; Rott, Philippe; Cociancich, Stéphane

    2012-11-21

    Xanthomonas albilineans causes leaf scald, a lethal disease of sugarcane. X. albilineans exhibits distinctive pathogenic mechanisms, ecology and taxonomy compared to other species of Xanthomonas. For example, this species produces a potent DNA gyrase inhibitor called albicidin that is largely responsible for inducing disease symptoms; its habitat is limited to xylem; and the species exhibits large variability. A first manuscript on the complete genome sequence of the highly pathogenic X. albilineans strain GPE PC73 focused exclusively on distinctive genomic features shared with Xylella fastidiosa-another xylem-limited Xanthomonadaceae. The present manuscript on the same genome sequence aims to describe all other pathogenicity-related genomic features of X. albilineans, and to compare, using suppression subtractive hybridization (SSH), genomic features of two strains differing in pathogenicity. Comparative genomic analyses showed that most of the known pathogenicity factors from other Xanthomonas species are conserved in X. albilineans, with the notable absence of two major determinants of the "artillery" of other plant pathogenic species of Xanthomonas: the xanthan gum biosynthesis gene cluster, and the type III secretion system Hrp (hypersensitive response and pathogenicity). Genomic features specific to X. albilineans that may contribute to specific adaptation of this pathogen to sugarcane xylem vessels were also revealed. SSH experiments led to the identification of 20 genes common to three highly pathogenic strains but missing in a less pathogenic strain. These 20 genes, which include four ABC transporter genes, a methyl-accepting chemotaxis protein gene and an oxidoreductase gene, could play a key role in pathogenicity. With the exception of hypothetical proteins revealed by our comparative genomic analyses and SSH experiments, no genes potentially involved in any offensive or counter-defensive mechanism specific to X. albilineans were identified, supposing that X. albilineans has a reduced artillery compared to other pathogenic Xanthomonas species. Particular attention has therefore been given to genomic features specific to X. albilineans making it more capable of evading sugarcane surveillance systems or resisting sugarcane defense systems. This study confirms that X. albilineans is a highly distinctive species within the genus Xanthomonas, and opens new perpectives towards a greater understanding of the pathogenicity of this destructive sugarcane pathogen.

  8. Terminal Duplex Stability and Nucleotide Identity Differentially Control siRNA Loading and Activity in RNA Interference

    PubMed Central

    Angart, Phillip A.; Carlson, Rebecca J.; Adu-Berchie, Kwasi

    2016-01-01

    Efficient short interfering RNA (siRNA)-mediated gene silencing requires selection of a sequence that is complementary to the intended target and possesses sequence and structural features that encourage favorable functional interactions with the RNA interference (RNAi) pathway proteins. In this study, we investigated how terminal sequence and structural characteristics of siRNAs contribute to siRNA strand loading and silencing activity and how these characteristics ultimately result in a functionally asymmetric duplex in cultured HeLa cells. Our results reiterate that the most important characteristic in determining siRNA activity is the 5′ terminal nucleotide identity. Our findings further suggest that siRNA loading is controlled principally by the hybridization stability of the 5′ terminus (Nucleotides: 1–2) of each siRNA strand, independent of the opposing terminus. Postloading, RNA-induced silencing complex (RISC)–specific activity was found to be improved by lower hybridization stability in the 5′ terminus (Nucleotides: 3–4) of the loaded siRNA strand and greater hybridization stability toward the 3′ terminus (Nucleotides: 17–18). Concomitantly, specific recognition of the 5′ terminal nucleotide sequence by human Argonaute 2 (Ago2) improves RISC half-life. These findings indicate that careful selection of siRNA sequences can maximize both the loading and the specific activity of the intended guide strand. PMID:27399870

  9. Noonan syndrome - a new survey.

    PubMed

    Tafazoli, Alireza; Eshraghi, Peyman; Koleti, Zahra Kamel; Abbaszadegan, Mohammadreza

    2017-02-01

    Noonan syndrome (NS) is an autosomal dominant disorder with vast heterogeneity in clinical and genetic features. Various symptoms have been reported for this abnormality such as short stature, unusual facial characteristics, congenital heart abnormalities, developmental complications, and an elevated tumor incidence rate. Noonan syndrome shares clinical features with other rare conditions, including LEOPARD syndrome, cardio-facio-cutaneous syndrome, Noonan-like syndrome with loose anagen hair, and Costello syndrome. Germline mutations in the RAS-MAPK (mitogen-activated protein kinase) signal transduction pathway are responsible for NS and other related disorders. Noonan syndrome diagnosis is primarily based on clinical features, but molecular testing should be performed to confirm it in patients. Due to the high number of genes associated with NS and other RASopathy disorders, next-generation sequencing is the best choice for diagnostic testing. Patients with NS also have higher risk for leukemia and specific solid tumors. Age-specific guidelines for the management of NS are available.

  10. Noonan syndrome – a new survey

    PubMed Central

    Tafazoli, Alireza; Eshraghi, Peyman; Koleti, Zahra Kamel

    2016-01-01

    Noonan syndrome (NS) is an autosomal dominant disorder with vast heterogeneity in clinical and genetic features. Various symptoms have been reported for this abnormality such as short stature, unusual facial characteristics, congenital heart abnormalities, developmental complications, and an elevated tumor incidence rate. Noonan syndrome shares clinical features with other rare conditions, including LEOPARD syndrome, cardio-facio-cutaneous syndrome, Noonan-like syndrome with loose anagen hair, and Costello syndrome. Germline mutations in the RAS-MAPK (mitogen-activated protein kinase) signal transduction pathway are responsible for NS and other related disorders. Noonan syndrome diagnosis is primarily based on clinical features, but molecular testing should be performed to confirm it in patients. Due to the high number of genes associated with NS and other RASopathy disorders, next-generation sequencing is the best choice for diagnostic testing. Patients with NS also have higher risk for leukemia and specific solid tumors. Age-specific guidelines for the management of NS are available. PMID:28144274

  11. Protein functional features are reflected in the patterns of mRNA translation speed.

    PubMed

    López, Daniel; Pazos, Florencio

    2015-07-09

    The degeneracy of the genetic code makes it possible for the same amino acid string to be coded by different messenger RNA (mRNA) sequences. These "synonymous mRNAs" may differ largely in a number of aspects related to their overall translational efficiency, such as secondary structure content and availability of the encoded transfer RNAs (tRNAs). Consequently, they may render different yields of the translated polypeptides. These mRNA features related to translation efficiency are also playing a role locally, resulting in a non-uniform translation speed along the mRNA, which has been previously related to some protein structural features and also used to explain some dramatic effects of "silent" single-nucleotide-polymorphisms (SNPs). In this work we perform the first large scale analysis of the relationship between three experimental proxies of mRNA local translation efficiency and the local features of the corresponding encoded proteins. We found that a number of protein functional and structural features are reflected in the patterns of ribosome occupancy, secondary structure and tRNA availability along the mRNA. One or more of these proxies of translation speed have distinctive patterns around the mRNA regions coding for certain protein local features. In some cases the three patterns follow a similar trend. We also show specific examples where these patterns of translation speed point to the protein's important structural and functional features. This support the idea that the genome not only codes the protein functional features as sequences of amino acids, but also as subtle patterns of mRNA properties which, probably through local effects on the translation speed, have some consequence on the final polypeptide. These results open the possibility of predicting a protein's functional regions based on a single genomic sequence, and have implications for heterologous protein expression and fine-tuning protein function.

  12. Fast H-DROP: A thirty times accelerated version of H-DROP for interactive SVM-based prediction of helical domain linkers

    NASA Astrophysics Data System (ADS)

    Richa, Tambi; Ide, Soichiro; Suzuki, Ryosuke; Ebina, Teppei; Kuroda, Yutaka

    2017-02-01

    Efficient and rapid prediction of domain regions from amino acid sequence information alone is often required for swift structural and functional characterization of large multi-domain proteins. Here we introduce Fast H-DROP, a thirty times accelerated version of our previously reported H-DROP (Helical Domain linker pRediction using OPtimal features), which is unique in specifically predicting helical domain linkers (boundaries). Fast H-DROP, analogously to H-DROP, uses optimum features selected from a set of 3000 ones by combining a random forest and a stepwise feature selection protocol. We reduced the computational time from 8.5 min per sequence in H-DROP to 14 s per sequence in Fast H-DROP on an 8 Xeon processor Linux server by using SWISS-PROT instead of Genbank non-redundant (nr) database for generating the PSSMs. The sensitivity and precision of Fast H-DROP assessed by cross-validation were 33.7 and 36.2%, which were merely 2% lower than that of H-DROP. The reduced computational time of Fast H-DROP, without affecting prediction performances, makes it more interactive and user-friendly. Fast H-DROP and H-DROP are freely available from http://domserv.lab.tuat.ac.jp/.

  13. Japanese Wolves are Genetically Divided into Two Groups Based on an 8-Nucleotide Insertion/Deletion within the mtDNA Control Region.

    PubMed

    Ishiguro, Naotaka; Inoshima, Yasuo; Yanai, Tokuma; Sasaki, Motoki; Matsui, Akira; Kikuchi, Hiroki; Maruyama, Masashi; Hongo, Hitomi; Vostretsov, Yuri E; Gasilin, Viatcheslav; Kosintsev, Pavel A; Quanjia, Chen; Chunxue, Wang

    2016-02-01

    The mitochondrial DNA (mtDNA) control region (198- to 598-bp) of four ancient Canis specimens (two Canis mandibles, a cranium, and a first phalanx) was examined, and each specimen was genetically identified as Japanese wolf. Two unique nucleotide substitutions, the 78-C insertion and the 482-G deletion, both of which are specific for Japanese wolf, were observed in each sample. Based on the mtDNA sequences analyzed, these four specimens and 10 additional Japanese wolf samples could be classified into two groups- Group A (10 samples) and Group B (4 samples)-which contain or lack an 8-bp insertion/deletion (indel), respectively. Interestingly, three dogs (Akita-b, Kishu 25, and S-husky 102) that each contained Japanese wolf-specific features were also classified into Group A or B based on the 8-bp indel. To determine the origin or ancestor of the Japanese wolf, mtDNA control regions of ancient continental Canis specimens were examined; 84 specimens were from Russia, and 29 were from China. However, none of these 113 specimens contained Japanese wolf-specific sequences. Moreover, none of 426 Japanese modern hunting dogs examined contained these Japanese wolf-specific mtDNA sequences. The mtDNA control region sequences of Groups A and B appeared to be unique to grey wolf and dog populations.

  14. Human Y chromosome copy number variation in the next generation sequencing era and beyond.

    PubMed

    Massaia, Andrea; Xue, Yali

    2017-05-01

    The human Y chromosome provides a fertile ground for structural rearrangements owing to its haploidy and high content of repeated sequences. The methodologies used for copy number variation (CNV) studies have developed over the years. Low-throughput techniques based on direct observation of rearrangements were developed early on, and are still used, often to complement array-based or sequencing approaches which have limited power in regions with high repeat content and specifically in the presence of long, identical repeats, such as those found in human sex chromosomes. Some specific rearrangements have been investigated for decades; because of their effects on fertility, or their outstanding evolutionary features, the interest in these has not diminished. However, following the flourishing of large-scale genomics, several studies have investigated CNVs across the whole chromosome. These studies sometimes employ data generated within large genomic projects such as the DDD study or the 1000 Genomes Project, and often survey large samples of healthy individuals without any prior selection. Novel technologies based on sequencing long molecules and combinations of technologies, promise to stimulate the study of Y-CNVs in the immediate future.

  15. Sequence-Specific Recognition of DNA by Proteins: Binding Motifs Discovered Using a Novel Statistical/Computational Analysis

    PubMed Central

    Jakubec, David; Laskowski, Roman A.; Vondrasek, Jiri

    2016-01-01

    Decades of intensive experimental studies of the recognition of DNA sequences by proteins have provided us with a view of a diverse and complicated world in which few to no features are shared between individual DNA-binding protein families. The originally conceived direct readout of DNA residue sequences by amino acid side chains offers very limited capacity for sequence recognition, while the effects of the dynamic properties of the interacting partners remain difficult to quantify and almost impossible to generalise. In this work we investigated the energetic characteristics of all DNA residue—amino acid side chain combinations in the conformations found at the interaction interface in a very large set of protein—DNA complexes by the means of empirical potential-based calculations. General specificity-defining criteria were derived and utilised to look beyond the binding motifs considered in previous studies. Linking energetic favourability to the observed geometrical preferences, our approach reveals several additional amino acid motifs which can distinguish between individual DNA bases. Our results remained valid in environments with various dielectric properties. PMID:27384774

  16. Learning multiple variable-speed sequences in striatum via cortical tutoring.

    PubMed

    Murray, James M; Escola, G Sean

    2017-05-08

    Sparse, sequential patterns of neural activity have been observed in numerous brain areas during timekeeping and motor sequence tasks. Inspired by such observations, we construct a model of the striatum, an all-inhibitory circuit where sequential activity patterns are prominent, addressing the following key challenges: (i) obtaining control over temporal rescaling of the sequence speed, with the ability to generalize to new speeds; (ii) facilitating flexible expression of distinct sequences via selective activation, concatenation, and recycling of specific subsequences; and (iii) enabling the biologically plausible learning of sequences, consistent with the decoupling of learning and execution suggested by lesion studies showing that cortical circuits are necessary for learning, but that subcortical circuits are sufficient to drive learned behaviors. The same mechanisms that we describe can also be applied to circuits with both excitatory and inhibitory populations, and hence may underlie general features of sequential neural activity pattern generation in the brain.

  17. Diverse RNA-binding proteins interact with functionally related sets of RNAs, suggesting an extensive regulatory system.

    PubMed

    Hogan, Daniel J; Riordan, Daniel P; Gerber, André P; Herschlag, Daniel; Brown, Patrick O

    2008-10-28

    RNA-binding proteins (RBPs) have roles in the regulation of many post-transcriptional steps in gene expression, but relatively few RBPs have been systematically studied. We searched for the RNA targets of 40 proteins in the yeast Saccharomyces cerevisiae: a selective sample of the approximately 600 annotated and predicted RBPs, as well as several proteins not annotated as RBPs. At least 33 of these 40 proteins, including three of the four proteins that were not previously known or predicted to be RBPs, were reproducibly associated with specific sets of a few to several hundred RNAs. Remarkably, many of the RBPs we studied bound mRNAs whose protein products share identifiable functional or cytotopic features. We identified specific sequences or predicted structures significantly enriched in target mRNAs of 16 RBPs. These potential RNA-recognition elements were diverse in sequence, structure, and location: some were found predominantly in 3'-untranslated regions, others in 5'-untranslated regions, some in coding sequences, and many in two or more of these features. Although this study only examined a small fraction of the universe of yeast RBPs, 70% of the mRNA transcriptome had significant associations with at least one of these RBPs, and on average, each distinct yeast mRNA interacted with three of the RBPs, suggesting the potential for a rich, multidimensional network of regulation. These results strongly suggest that combinatorial binding of RBPs to specific recognition elements in mRNAs is a pervasive mechanism for multi-dimensional regulation of their post-transcriptional fate.

  18. Transcription Factor Information System (TFIS): A Tool for Detection of Transcription Factor Binding Sites.

    PubMed

    Narad, Priyanka; Kumar, Abhishek; Chakraborty, Amlan; Patni, Pranav; Sengupta, Abhishek; Wadhwa, Gulshan; Upadhyaya, K C

    2017-09-01

    Transcription factors are trans-acting proteins that interact with specific nucleotide sequences known as transcription factor binding site (TFBS), and these interactions are implicated in regulation of the gene expression. Regulation of transcriptional activation of a gene often involves multiple interactions of transcription factors with various sequence elements. Identification of these sequence elements is the first step in understanding the underlying molecular mechanism(s) that regulate the gene expression. For in silico identification of these sequence elements, we have developed an online computational tool named transcription factor information system (TFIS) for detecting TFBS for the first time using a collection of JAVA programs and is mainly based on TFBS detection using position weight matrix (PWM). The database used for obtaining position frequency matrices (PFM) is JASPAR and HOCOMOCO, which is an open-access database of transcription factor binding profiles. Pseudo-counts are used while converting PFM to PWM, and TFBS detection is carried out on the basis of percent score taken as threshold value. TFIS is equipped with advanced features such as direct sequence retrieving from NCBI database using gene identification number and accession number, detecting binding site for common TF in a batch of gene sequences, and TFBS detection after generating PWM from known raw binding sequences in addition to general detection methods. TFIS can detect the presence of potential TFBSs in both the directions at the same time. This feature increases its efficiency. And the results for this dual detection are presented in different colors specific to the orientation of the binding site. Results obtained by the TFIS are more detailed and specific to the detected TFs as integration of more informative links from various related web servers are added in the result pages like Gene Ontology, PAZAR database and Transcription Factor Encyclopedia in addition to NCBI and UniProt. Common TFs like SP1, AP1 and NF-KB of the Amyloid beta precursor gene is easily detected using TFIS along with multiple binding sites. In another scenario of embryonic developmental process, TFs of the FOX family (FOXL1 and FOXC1) were also identified. TFIS is platform-independent which is publicly available along with its support and documentation at http://tfistool.appspot.com and http://www.bioinfoplus.com/tfis/ . TFIS is licensed under the GNU General Public License, version 3 (GPL-3.0).

  19. CRISPRDetect: A flexible algorithm to define CRISPR arrays.

    PubMed

    Biswas, Ambarish; Staals, Raymond H J; Morales, Sergio E; Fineran, Peter C; Brown, Chris M

    2016-05-17

    CRISPR (clustered regularly interspaced short palindromic repeats) RNAs provide the specificity for noncoding RNA-guided adaptive immune defence systems in prokaryotes. CRISPR arrays consist of repeat sequences separated by specific spacer sequences. CRISPR arrays have previously been identified in a large proportion of prokaryotic genomes. However, currently available detection algorithms do not utilise recently discovered features regarding CRISPR loci. We have developed a new approach to automatically detect, predict and interactively refine CRISPR arrays. It is available as a web program and command line from bioanalysis.otago.ac.nz/CRISPRDetect. CRISPRDetect discovers putative arrays, extends the array by detecting additional variant repeats, corrects the direction of arrays, refines the repeat/spacer boundaries, and annotates different types of sequence variations (e.g. insertion/deletion) in near identical repeats. Due to these features, CRISPRDetect has significant advantages when compared to existing identification tools. As well as further support for small medium and large repeats, CRISPRDetect identified a class of arrays with 'extra-large' repeats in bacteria (repeats 44-50 nt). The CRISPRDetect output is integrated with other analysis tools. Notably, the predicted spacers can be directly utilised by CRISPRTarget to predict targets. CRISPRDetect enables more accurate detection of arrays and spacers and its gff output is suitable for inclusion in genome annotation pipelines and visualisation. It has been used to analyse all complete bacterial and archaeal reference genomes.

  20. Liquid-gas phase transition in asymmetric nuclear matter at finite temperature

    NASA Astrophysics Data System (ADS)

    Maruyama, Toshiki; Tatsumi, Toshitaka; Chiba, Satoshi

    2010-03-01

    Liquid-gas phase transition is discussed in warm asymmetric nuclear matter. Some peculiar features are figured out from the viewpoint of the basic thermodynamics about the phase equilibrium. We treat the mixed phase of the binary system based on the Gibbs conditions. When the Coulomb interaction is included, the mixed phase is no more uniform and the sequence of the pasta structures appears. Comparing the results with those given by the simple bulk calculation without the Coulomb interaction, we extract specific features of the pasta structures at finite temperature.

  1. Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches.

    PubMed

    Yugandhar, K; Gromiha, M Michael

    2014-09-01

    Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein-protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein-protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 protein-protein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein-protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein-protein interaction networks and human-pathogen interactions based on the strength of interactions. © 2014 Wiley Periodicals, Inc.

  2. Salient Features of Endonuclease Platforms for Therapeutic Genome Editing.

    PubMed

    Certo, Michael T; Morgan, Richard A

    2016-03-01

    Emerging gene-editing technologies are nearing a revolutionary phase in genetic medicine: precisely modifying or repairing causal genetic defects. This may include any number of DNA sequence manipulations, such as knocking out a deleterious gene, introducing a particular mutation, or directly repairing a defective sequence by site-specific recombination. All of these edits can currently be achieved via programmable rare-cutting endonucleases to create targeted DNA breaks that can engage and exploit endogenous DNA repair pathways to impart site-specific genetic changes. Over the past decade, several distinct technologies for introducing site-specific DNA breaks have been developed, yet the different biological origins of these gene-editing technologies bring along inherent differences in parameters that impact clinical implementation. This review aims to provide an accessible overview of the various endonuclease-based gene-editing platforms, highlighting the strengths and weakness of each with respect to therapeutic applications.

  3. Salient Features of Endonuclease Platforms for Therapeutic Genome Editing

    PubMed Central

    Certo, Michael T; Morgan, Richard A

    2016-01-01

    Emerging gene-editing technologies are nearing a revolutionary phase in genetic medicine: precisely modifying or repairing causal genetic defects. This may include any number of DNA sequence manipulations, such as knocking out a deleterious gene, introducing a particular mutation, or directly repairing a defective sequence by site-specific recombination. All of these edits can currently be achieved via programmable rare-cutting endonucleases to create targeted DNA breaks that can engage and exploit endogenous DNA repair pathways to impart site-specific genetic changes. Over the past decade, several distinct technologies for introducing site-specific DNA breaks have been developed, yet the different biological origins of these gene-editing technologies bring along inherent differences in parameters that impact clinical implementation. This review aims to provide an accessible overview of the various endonuclease-based gene-editing platforms, highlighting the strengths and weakness of each with respect to therapeutic applications. PMID:26796671

  4. Development of Genome Engineering Tools from Plant-Specific PPR Proteins Using Animal Cultured Cells.

    PubMed

    Kobayashi, Takehito; Yagi, Yusuke; Nakamura, Takahiro

    2016-01-01

    The pentatricopeptide repeat (PPR) motif is a sequence-specific RNA/DNA-binding module. Elucidation of the RNA/DNA recognition mechanism has enabled engineering of PPR motifs as new RNA/DNA manipulation tools in living cells, including for genome editing. However, the biochemical characteristics of PPR proteins remain unknown, mostly due to the instability and/or unfolding propensities of PPR proteins in heterologous expression systems such as bacteria and yeast. To overcome this issue, we constructed reporter systems using animal cultured cells. The cell-based system has highly attractive features for PPR engineering: robust eukaryotic gene expression; availability of various vectors, reagents, and antibodies; highly efficient DNA delivery ratio (>80 %); and rapid, high-throughput data production. In this chapter, we introduce an example of such reporter systems: a PPR-based sequence-specific translational activation system. The cell-based reporter system can be applied to characterize plant genes of interested and to PPR engineering.

  5. An enhanceosome containing the Jun B/Fra-2 heterodimer and the HMG-I(Y) architectural protein controls HPV 18 transcription.

    PubMed

    Bouallaga, I; Massicard, S; Yaniv, M; Thierry, F

    2000-11-01

    Recent studies have reported new mechanisms that mediate the transcriptional synergy of strong tissue-specific enhancers, involving the cooperative assembly of higher-order nucleoprotein complexes called enhanceosomes. Here we show that the HPV18 enhancer, which controls the epithelial-specific transcription of the E6 and E7 transforming genes, exhibits characteristic features of these structures. We used deletion experiments to show that a core enhancer element cooperates, in a specific helical phasing, with distant essential factors binding to the ends of the enhancer. This core sequence, binding a Jun B/Fra-2 heterodimer, cooperatively recruits the architectural protein HMG-I(Y) in a nucleoprotein complex, where they interact with each other. Therefore, in HeLa cells, HPV18 transcription seems to depend upon the assembly of an enhanceosome containing multiple cellular factors recruited by a core sequence interacting with AP1 and HMG-I(Y).

  6. A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing.

    PubMed

    van den Akker, Jeroen; Mishne, Gilad; Zimmer, Anjali D; Zhou, Alicia Y

    2018-04-17

    Next generation sequencing (NGS) has become a common technology for clinical genetic tests. The quality of NGS calls varies widely and is influenced by features like reference sequence characteristics, read depth, and mapping accuracy. With recent advances in NGS technology and software tools, the majority of variants called using NGS alone are in fact accurate and reliable. However, a small subset of difficult-to-call variants that still do require orthogonal confirmation exist. For this reason, many clinical laboratories confirm NGS results using orthogonal technologies such as Sanger sequencing. Here, we report the development of a deterministic machine-learning-based model to differentiate between these two types of variant calls: those that do not require confirmation using an orthogonal technology (high confidence), and those that require additional quality testing (low confidence). This approach allows reliable NGS-based calling in a clinical setting by identifying the few important variant calls that require orthogonal confirmation. We developed and tested the model using a set of 7179 variants identified by a targeted NGS panel and re-tested by Sanger sequencing. The model incorporated several signals of sequence characteristics and call quality to determine if a variant was identified at high or low confidence. The model was tuned to eliminate false positives, defined as variants that were called by NGS but not confirmed by Sanger sequencing. The model achieved very high accuracy: 99.4% (95% confidence interval: +/- 0.03%). It categorized 92.2% (6622/7179) of the variants as high confidence, and 100% of these were confirmed to be present by Sanger sequencing. Among the variants that were categorized as low confidence, defined as NGS calls of low quality that are likely to be artifacts, 92.1% (513/557) were found to be not present by Sanger sequencing. This work shows that NGS data contains sufficient characteristics for a machine-learning-based model to differentiate low from high confidence variants. Additionally, it reveals the importance of incorporating site-specific features as well as variant call features in such a model.

  7. A gradient-boosting approach for filtering de novo mutations in parent-offspring trios.

    PubMed

    Liu, Yongzhuang; Li, Bingshan; Tan, Renjie; Zhu, Xiaolin; Wang, Yadong

    2014-07-01

    Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  8. PrPC Governs Susceptibility to Prion Strains in Bank Vole, While Other Host Factors Modulate Strain Features

    PubMed Central

    Espinosa, J. C.; Nonno, R.; Di Bari, M.; Aguilar-Calvo, P.; Pirisinu, L.; Fernández-Borges, N.; Vanni, I.; Vaccari, G.; Marín-Moreno, A.; Frassanito, P.; Lorenzo, P.; Agrimi, U.

    2016-01-01

    ABSTRACT Bank vole is a rodent species that shows differential susceptibility to the experimental transmission of different prion strains. In this work, the transmission features of a panel of diverse prions with distinct origins were assayed both in bank vole expressing methionine at codon 109 (Bv109M) and in transgenic mice expressing physiological levels of bank vole PrPC (the BvPrP-Tg407 mouse line). This work is the first systematic comparison of the transmission features of a collection of prion isolates, representing a panel of diverse prion strains, in a transgenic-mouse model and in its natural counterpart. The results showed very similar transmission properties in both the natural species and the transgenic-mouse model, demonstrating the key role of the PrP amino acid sequence in prion transmission susceptibility. However, differences in the PrPSc types propagated by Bv109M and BvPrP-Tg407 suggest that host factors other than PrPC modulate prion strain features. IMPORTANCE The differential susceptibility of bank voles to prion strains can be modeled in transgenic mice, suggesting that this selective susceptibility is controlled by the vole PrP sequence alone rather than by other species-specific factors. Differences in the phenotypes observed after prion transmissions in bank voles and in the transgenic mice suggest that host factors other than the PrPC sequence may affect the selection of the substrain replicating in the animal model. PMID:27654300

  9. Prediction of redox-sensitive cysteines using sequential distance and other sequence-based features.

    PubMed

    Sun, Ming-An; Zhang, Qing; Wang, Yejun; Ge, Wei; Guo, Dianjing

    2016-08-24

    Reactive oxygen species can modify the structure and function of proteins and may also act as important signaling molecules in various cellular processes. Cysteine thiol groups of proteins are particularly susceptible to oxidation. Meanwhile, their reversible oxidation is of critical roles for redox regulation and signaling. Recently, several computational tools have been developed for predicting redox-sensitive cysteines; however, those methods either only focus on catalytic redox-sensitive cysteines in thiol oxidoreductases, or heavily depend on protein structural data, thus cannot be widely used. In this study, we analyzed various sequence-based features potentially related to cysteine redox-sensitivity, and identified three types of features for efficient computational prediction of redox-sensitive cysteines. These features are: sequential distance to the nearby cysteines, PSSM profile and predicted secondary structure of flanking residues. After further feature selection using SVM-RFE, we developed Redox-Sensitive Cysteine Predictor (RSCP), a SVM based classifier for redox-sensitive cysteine prediction using primary sequence only. Using 10-fold cross-validation on RSC758 dataset, the accuracy, sensitivity, specificity, MCC and AUC were estimated as 0.679, 0.602, 0.756, 0.362 and 0.727, respectively. When evaluated using 10-fold cross-validation with BALOSCTdb dataset which has structure information, the model achieved performance comparable to current structure-based method. Further validation using an independent dataset indicates it is robust and of relatively better accuracy for predicting redox-sensitive cysteines from non-enzyme proteins. In this study, we developed a sequence-based classifier for predicting redox-sensitive cysteines. The major advantage of this method is that it does not rely on protein structure data, which ensures more extensive application compared to other current implementations. Accurate prediction of redox-sensitive cysteines not only enhances our understanding about the redox sensitivity of cysteine, it may also complement the proteomics approach and facilitate further experimental investigation of important redox-sensitive cysteines.

  10. Protein structure based prediction of catalytic residues

    PubMed Central

    2013-01-01

    Background Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation. Results We explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods. Conclusions We found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases. PMID:23433045

  11. The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments

    PubMed Central

    2009-01-01

    Background The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods. Results We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency. Conclusion We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms. PMID:19958473

  12. Detection of cystic fibrosis mutations in a GeneChip{trademark} assay format

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Miyada, C.G.; Cronin, M.T.; Kim, S.M.

    1994-09-01

    We are developing assays for the detection of cystic fibrosis mutations based on DNA hybridization. A DNA sample is amplified by PCR, labeled by incorporating a fluorescein-tagged dNTP, enzymatically treated to produce smaller fragments and hybridized to a series of short (13-16 bases) oligonucleotides synthesized on a glass surface via photolithography. The hybrids are detected by eqifluorescence and mutations are identified by the specific pattern of hybridization. In a GeneChip assay, the chip surface is composed of a series of subarrays, each being specific for a particular mutation. Each subarray is further subdivided into a series of probes (40 total),more » half based on the mutant sequence and the remainder based on the wild-type sequence. For each of the subarrays, there is a redundancy in the number of probes that should hybridize to either a wild-type or a mutant target. The multiple probe strategy provides sequence information for a short five base region overlapping the mutation site. In addition, homozygous wild-type and mutant as well as heterozygous samples are each identified by a specific pattern of hybridization. The small size of each probe feature (250 x 250 {mu}m{sup 2}) permits the inclusion of additional probes required to generate sequence information by hybridization.« less

  13. Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns.

    PubMed

    Ortuño, Francisco M; Valenzuela, Olga; Rojas, Fernando; Pomares, Hector; Florido, Javier P; Urquiza, Jose M; Rojas, Ignacio

    2013-09-01

    Multiple sequence alignments (MSAs) are widely used approaches in bioinformatics to carry out other tasks such as structure predictions, biological function analyses or phylogenetic modeling. However, current tools usually provide partially optimal alignments, as each one is focused on specific biological features. Thus, the same set of sequences can produce different alignments, above all when sequences are less similar. Consequently, researchers and biologists do not agree about which is the most suitable way to evaluate MSAs. Recent evaluations tend to use more complex scores including further biological features. Among them, 3D structures are increasingly being used to evaluate alignments. Because structures are more conserved in proteins than sequences, scores with structural information are better suited to evaluate more distant relationships between sequences. The proposed multiobjective algorithm, based on the non-dominated sorting genetic algorithm, aims to jointly optimize three objectives: STRIKE score, non-gaps percentage and totally conserved columns. It was significantly assessed on the BAliBASE benchmark according to the Kruskal-Wallis test (P < 0.01). This algorithm also outperforms other aligners, such as ClustalW, Multiple Sequence Alignment Genetic Algorithm (MSA-GA), PRRP, DIALIGN, Hidden Markov Model Training (HMMT), Pattern-Induced Multi-sequence Alignment (PIMA), MULTIALIGN, Sequence Alignment Genetic Algorithm (SAGA), PILEUP, Rubber Band Technique Genetic Algorithm (RBT-GA) and Vertical Decomposition Genetic Algorithm (VDGA), according to the Wilcoxon signed-rank test (P < 0.05), whereas it shows results not significantly different to 3D-COFFEE (P > 0.05) with the advantage of being able to use less structures. Structural information is included within the objective function to evaluate more accurately the obtained alignments. The source code is available at http://www.ugr.es/~fortuno/MOSAStrE/MO-SAStrE.zip.

  14. Structure based re-design of the binding specificity of anti-apoptotic Bcl-xL

    PubMed Central

    Chen, T. Scott; Palacios, Hector; Keating, Amy E.

    2012-01-01

    Many native proteins are multi-specific and interact with numerous partners, which can confound analysis of their functions. Protein design provides a potential route to generating synthetic variants of native proteins with more selective binding profiles. Re-designed proteins could be used as research tools, diagnostics or therapeutics. In this work, we used a library screening approach to re-engineer the multi-specific anti-apoptotic protein Bcl-xL to remove its interactions with many of its binding partners, making it a high affinity and selective binder of the BH3 region of pro-apoptotic protein Bad. To overcome the enormity of the potential Bcl-xL sequence space, we developed and applied a computational/experimental framework that used protein structure information to generate focused combinatorial libraries. Sequence features were identified using structure-based modeling, and an optimization algorithm based on integer programming was used to select degenerate codons that maximally covered these features. A constraint on library size was used to ensure thorough sampling. Using yeast surface display to screen a designed library of Bcl-xL variants, we successfully identified a protein with ~1,000-fold improvement in binding specificity for the BH3 region of Bad over the BH3 region of Bim. Although negative design was targeted only against the BH3 region of Bim, the best re-designed protein was globally specific against binding to 10 other peptides corresponding to native BH3 motifs. Our design framework demonstrates an efficient route to highly specific protein binders and may readily be adapted for application to other design problems. PMID:23154169

  15. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.

    PubMed

    Bolleman, Jerven T; Mungall, Christopher J; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J P; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshiaki; Cock, Peter J A

    2016-06-13

    Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.

  16. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

    DOE PAGES

    Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco; ...

    2016-06-13

    Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less

  17. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco

    Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less

  18. Temporality of Features in Near-Death Experience Narratives

    PubMed Central

    Martial, Charlotte; Cassol, Héléna; Antonopoulos, Georgios; Charlier, Thomas; Heros, Julien; Donneau, Anne-Françoise; Charland-Verville, Vanessa; Laureys, Steven

    2017-01-01

    Background: After an occurrence of a Near-Death Experience (NDE), Near-Death Experiencers (NDErs) usually report extremely rich and detailed narratives. Phenomenologically, a NDE can be described as a set of distinguishable features. Some authors have proposed regular patterns of NDEs, however, the actual temporality sequence of NDE core features remains a little explored area. Objectives: The aim of the present study was to investigate the frequency distribution of these features (globally and according to the position of features in narratives) as well as the most frequently reported temporality sequences of features. Methods: We collected 154 French freely expressed written NDE narratives (i.e., Greyson NDE scale total score ≥ 7/32). A text analysis was conducted on all narratives in order to infer temporal ordering and frequency distribution of NDE features. Results: Our analyses highlighted the following most frequently reported sequence of consecutive NDE features: Out-of-Body Experience, Experiencing a tunnel, Seeing a bright light, Feeling of peace. Yet, this sequence was encountered in a very limited number of NDErs. Conclusion: These findings may suggest that NDEs temporality sequences can vary across NDErs. Exploring associations and relationships among features encountered during NDEs may complete the rigorous definition and scientific comprehension of the phenomenon. PMID:28659779

  19. Temporality of Features in Near-Death Experience Narratives.

    PubMed

    Martial, Charlotte; Cassol, Héléna; Antonopoulos, Georgios; Charlier, Thomas; Heros, Julien; Donneau, Anne-Françoise; Charland-Verville, Vanessa; Laureys, Steven

    2017-01-01

    Background: After an occurrence of a Near-Death Experience (NDE), Near-Death Experiencers (NDErs) usually report extremely rich and detailed narratives. Phenomenologically, a NDE can be described as a set of distinguishable features. Some authors have proposed regular patterns of NDEs, however, the actual temporality sequence of NDE core features remains a little explored area. Objectives: The aim of the present study was to investigate the frequency distribution of these features (globally and according to the position of features in narratives) as well as the most frequently reported temporality sequences of features. Methods: We collected 154 French freely expressed written NDE narratives (i.e., Greyson NDE scale total score ≥ 7/32). A text analysis was conducted on all narratives in order to infer temporal ordering and frequency distribution of NDE features. Results: Our analyses highlighted the following most frequently reported sequence of consecutive NDE features: Out-of-Body Experience, Experiencing a tunnel, Seeing a bright light, Feeling of peace. Yet, this sequence was encountered in a very limited number of NDErs. Conclusion: These findings may suggest that NDEs temporality sequences can vary across NDErs. Exploring associations and relationships among features encountered during NDEs may complete the rigorous definition and scientific comprehension of the phenomenon.

  20. Context influences on TALE–DNA binding revealed by quantitative profiling

    PubMed Central

    Rogers, Julia M.; Barrera, Luis A.; Reyon, Deepak; Sander, Jeffry D.; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L.

    2015-01-01

    Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE–DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000–20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE–DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design. PMID:26067805

  1. Context influences on TALE-DNA binding revealed by quantitative profiling.

    PubMed

    Rogers, Julia M; Barrera, Luis A; Reyon, Deepak; Sander, Jeffry D; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L

    2015-06-11

    Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE-DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000-20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE-DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design.

  2. Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites.

    PubMed

    Wang, Yanan; Song, Jiangning; Marquez-Lago, Tatiana T; Leier, André; Li, Chen; Lithgow, Trevor; Webb, Geoffrey I; Shen, Hong-Bin

    2017-07-18

    Matrix Metalloproteases (MMPs) are an important family of proteases that play crucial roles in key cellular and disease processes. Therefore, MMPs constitute important targets for drug design, development and delivery. Advanced proteomic technologies have identified type-specific target substrates; however, the complete repertoire of MMP substrates remains uncharacterized. Indeed, computational prediction of substrate-cleavage sites associated with MMPs is a challenging problem. This holds especially true when considering MMPs with few experimentally verified cleavage sites, such as for MMP-2, -3, -7, and -8. To fill this gap, we propose a new knowledge-transfer computational framework which effectively utilizes the hidden shared knowledge from some MMP types to enhance predictions of other, distinct target substrate-cleavage sites. Our computational framework uses support vector machines combined with transfer machine learning and feature selection. To demonstrate the value of the model, we extracted a variety of substrate sequence-derived features and compared the performance of our method using both 5-fold cross-validation and independent tests. The results show that our transfer-learning-based method provides a robust performance, which is at least comparable to traditional feature-selection methods for prediction of MMP-2, -3, -7, -8, -9 and -12 substrate-cleavage sites on independent tests. The results also demonstrate that our proposed computational framework provides a useful alternative for the characterization of sequence-level determinants of MMP-substrate specificity.

  3. The hunt for origins of DNA replication in multicellular eukaryotes

    PubMed Central

    Urban, John M.; Foulk, Michael S.; Casella, Cinzia

    2015-01-01

    Origins of DNA replication (ORIs) occur at defined regions in the genome. Although DNA sequence defines the position of ORIs in budding yeast, the factors for ORI specification remain elusive in metazoa. Several methods have been used recently to map ORIs in metazoan genomes with the hope that features for ORI specification might emerge. These methods are reviewed here with analysis of their advantages and shortcomings. The various factors that may influence ORI selection for initiation of DNA replication are discussed. PMID:25926981

  4. Classification of motor intent in transradial amputees using sonomyography and spatio-temporal image analysis

    NASA Astrophysics Data System (ADS)

    Hariharan, Harishwaran; Aklaghi, Nima; Baker, Clayton A.; Rangwala, Huzefa; Kosecka, Jana; Sikdar, Siddhartha

    2016-04-01

    In spite of major advances in biomechanical design of upper extremity prosthetics, these devices continue to lack intuitive control. Conventional myoelectric control strategies typically utilize electromyography (EMG) signal amplitude sensed from forearm muscles. EMG has limited specificity in resolving deep muscle activity and poor signal-to-noise ratio. We have been investigating alternative control strategies that rely on real-time ultrasound imaging that can overcome many of the limitations of EMG. In this work, we present an ultrasound image sequence classification method that utilizes spatiotemporal features to describe muscle activity and classify motor intent. Ultrasound images of the forearm muscles were obtained from able-bodied subjects and a trans-radial amputee while they attempted different hand movements. A grid-based approach is used to test the feasibility of using spatio-temporal features by classifying hand motions performed by the subjects. Using the leave-one-out cross validation on image sequences acquired from able-bodied subjects, we observe that the grid-based approach is able to discern four hand motions with 95.31% accuracy. In case of the trans-radial amputee, we are able to discern three hand motions with 80% accuracy. In a second set of experiments, we study classification accuracy by extracting spatio-temporal sub-sequences the depict activity due to the motion of local anatomical interfaces. Short time and space limited cuboidal sequences are initially extracted and assigned an optical flow behavior label, based on a response function. The image space is clustered based on the location of cuboids and features calculated from the cuboids in each cluster. Using sequences of known motions, we extract feature vectors that describe said motion. A K-nearest neighbor classifier is designed for classification experiments. Using the leave-one-out cross validation on image sequences for an amputee subject, we demonstrate that the classifier is able to discern three important hand motions with an accuracy of 93.33% accuracy, 91-100% precision and 80-100% recall rate. We anticipate that ultrasound imaging based methods will address some limitations of conventional myoelectric sensing, while adding advantages inherent to ultrasound imaging.

  5. PlantCAZyme: a database for plant carbohydrate-active enzymes

    PubMed Central

    Ekstrom, Alexander; Taujale, Rahil; McGinn, Nathan; Yin, Yanbin

    2014-01-01

    PlantCAZyme is a database built upon dbCAN (database for automated carbohydrate active enzyme annotation), aiming to provide pre-computed sequence and annotation data of carbohydrate active enzymes (CAZymes) to plant carbohydrate and bioenergy research communities. The current version contains data of 43 790 CAZymes of 159 protein families from 35 plants (including angiosperms, gymnosperms, lycophyte and bryophyte mosses) and chlorophyte algae with fully sequenced genomes. Useful features of the database include: (i) a BLAST server and a HMMER server that allow users to search against our pre-computed sequence data for annotation purpose, (ii) a download page to allow batch downloading data of a specific CAZyme family or species and (iii) protein browse pages to provide an easy access to the most comprehensive sequence and annotation data. Database URL: http://cys.bios.niu.edu/plantcazyme/ PMID:25125445

  6. ProDeGe: A computational protocol for fully automated decontamination of genomes

    DOE PAGES

    Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott; ...

    2015-06-09

    Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.more » On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). Lastly, the procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.« less

  7. ProDeGe: A computational protocol for fully automated decontamination of genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott

    Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.more » On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). Lastly, the procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.« less

  8. Sequence signatures of allosteric proteins towards rational design.

    PubMed

    Namboodiri, Saritha; Verma, Chandra; Dhar, Pawan K; Giuliani, Alessandro; Nair, Achuthsankar S

    2010-12-01

    Allostery is the phenomenon of changes in the structure and activity of proteins that appear as a consequence of ligand binding at sites other than the active site. Studying mechanistic basis of allostery leading to protein design with predetermined functional endpoints is an important unmet need of synthetic biology. Here, we screened the amino acid sequence landscape in search of sequence-signatures of allostery using Recurrence Quantitative Analysis (RQA) method. A characteristic vector, comprised of 10 features extracted from RQA was defined for amino acid sequences. Using Principal Component Analysis, four factors were found to be important determinants of allosteric behavior. Our sequence-based predictor method shows 82.6% accuracy, 85.7% sensitivity and 77.9% specificity with the current dataset. Further, we show that Laminarity-Mean-hydrophobicity representing repeated hydrophobic patches is the most crucial indicator of allostery. To our best knowledge this is the first report that describes sequence determinants of allostery based on hydrophobicity. As an outcome of these findings, we plan to explore possibility of inducing allostery in proteins.

  9. Carnivore-specific SINEs (Can-SINEs): distribution, evolution, and genomic impact.

    PubMed

    Walters-Conte, Kathryn B; Johnson, Diana L E; Allard, Marc W; Pecon-Slattery, Jill

    2011-01-01

    Short interspersed nuclear elements (SINEs) are a type of class 1 transposable element (retrotransposon) with features that allow investigators to resolve evolutionary relationships between populations and species while providing insight into genome composition and function. Characterization of a Carnivora-specific SINE family, Can-SINEs, has, has aided comparative genomic studies by providing rare genomic changes, and neutral sequence variants often needed to resolve difficult evolutionary questions. In addition, Can-SINEs constitute a significant source of functional diversity with Carnivora. Publication of the whole-genome sequence of domestic dog, domestic cat, and giant panda serves as a valuable resource in comparative genomic inferences gleaned from Can-SINEs. In anticipation of forthcoming studies bolstered by new genomic data, this review describes the discovery and characterization of Can-SINE motifs as well as describes composition, distribution, and effect on genome function. As the contribution of noncoding sequences to genomic diversity becomes more apparent, SINEs and other transposable elements will play an increasingly large role in mammalian comparative genomics.

  10. Carnivore-Specific SINEs (Can-SINEs): Distribution, Evolution, and Genomic Impact

    PubMed Central

    Johnson, Diana L.E.; Allard, Marc W.; Pecon-Slattery, Jill

    2011-01-01

    Short interspersed nuclear elements (SINEs) are a type of class 1 transposable element (retrotransposon) with features that allow investigators to resolve evolutionary relationships between populations and species while providing insight into genome composition and function. Characterization of a Carnivora-specific SINE family, Can-SINEs, has, has aided comparative genomic studies by providing rare genomic changes, and neutral sequence variants often needed to resolve difficult evolutionary questions. In addition, Can-SINEs constitute a significant source of functional diversity with Carnivora. Publication of the whole-genome sequence of domestic dog, domestic cat, and giant panda serves as a valuable resource in comparative genomic inferences gleaned from Can-SINEs. In anticipation of forthcoming studies bolstered by new genomic data, this review describes the discovery and characterization of Can-SINE motifs as well as describes composition, distribution, and effect on genome function. As the contribution of noncoding sequences to genomic diversity becomes more apparent, SINEs and other transposable elements will play an increasingly large role in mammalian comparative genomics. PMID:21846743

  11. Unexpected features of the dark proteome.

    PubMed

    Perdigão, Nelson; Heinrich, Julian; Stolte, Christian; Sabir, Kenneth S; Buckley, Michael J; Tabor, Bruce; Signal, Beth; Gloss, Brian S; Hammang, Christopher J; Rost, Burkhard; Schafferhans, Andrea; O'Donoghue, Seán I

    2015-12-29

    We surveyed the "dark" proteome-that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44-54% of the proteome in eukaryotes and viruses was dark, compared with only ∼14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology.

  12. Single helically folded aromatic oligoamides that mimic the charge surface of double-stranded B-DNA

    NASA Astrophysics Data System (ADS)

    Ziach, Krzysztof; Chollet, Céline; Parissi, Vincent; Prabhakaran, Panchami; Marchivie, Mathieu; Corvaglia, Valentina; Bose, Partha Pratim; Laxmi-Reddy, Katta; Godde, Frédéric; Schmitter, Jean-Marie; Chaignepain, Stéphane; Pourquier, Philippe; Huc, Ivan

    2018-05-01

    Numerous essential biomolecular processes require the recognition of DNA surface features by proteins. Molecules mimicking these features could potentially act as decoys and interfere with pharmacologically or therapeutically relevant protein-DNA interactions. Although naturally occurring DNA-mimicking proteins have been described, synthetic tunable molecules that mimic the charge surface of double-stranded DNA are not known. Here, we report the design, synthesis and structural characterization of aromatic oligoamides that fold into single helical conformations and display a double helical array of negatively charged residues in positions that match the phosphate moieties in B-DNA. These molecules were able to inhibit several enzymes possessing non-sequence-selective DNA-binding properties, including topoisomerase 1 and HIV-1 integrase, presumably through specific foldamer-protein interactions, whereas sequence-selective enzymes were not inhibited. Such modular and synthetically accessible DNA mimics provide a versatile platform to design novel inhibitors of protein-DNA interactions.

  13. Unexpected features of the dark proteome

    PubMed Central

    Perdigão, Nelson; Heinrich, Julian; Stolte, Christian; Sabir, Kenneth S.; Buckley, Michael J.; Tabor, Bruce; Signal, Beth; Gloss, Brian S.; Hammang, Christopher J.; Rost, Burkhard; Schafferhans, Andrea

    2015-01-01

    We surveyed the “dark” proteome–that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44–54% of the proteome in eukaryotes and viruses was dark, compared with only ∼14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology. PMID:26578815

  14. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines

    PubMed Central

    2014-01-01

    Background It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. Results We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5Å. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637Å. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. Conclusion SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/. PMID:24776231

  15. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines.

    PubMed

    Cao, Renzhi; Wang, Zheng; Wang, Yiheng; Cheng, Jianlin

    2014-04-28

    It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5Å. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637Å. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.

  16. Biophysical and structural considerations for protein sequence evolution

    PubMed Central

    2011-01-01

    Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites. Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model. PMID:22171550

  17. Conservation of the Human Integrin-Type Beta-Propeller Domain in Bacteria

    PubMed Central

    Chouhan, Bhanupratap; Denesyuk, Alexander; Heino, Jyrki; Johnson, Mark S.; Denessiouk, Konstantin

    2011-01-01

    Integrins are heterodimeric cell-surface receptors with key functions in cell-cell and cell-matrix adhesion. Integrin α and β subunits are present throughout the metazoans, but it is unclear whether the subunits predate the origin of multicellular organisms. Several component domains have been detected in bacteria, one of which, a specific 7-bladed β-propeller domain, is a unique feature of the integrin α subunits. Here, we describe a structure-derived motif, which incorporates key features of each blade from the X-ray structures of human αIIbβ3 and αVβ3, includes elements of the FG-GAP/Cage and Ca2+-binding motifs, and is specific only for the metazoan integrin domains. Separately, we searched for the metazoan integrin type β-propeller domains among all available sequences from bacteria and unicellular eukaryotic organisms, which must incorporate seven repeats, corresponding to the seven blades of the β-propeller domain, and so that the newly found structure-derived motif would exist in every repeat. As the result, among 47 available genomes of unicellular eukaryotes we could not find a single instance of seven repeats with the motif. Several sequences contained three repeats, a predicted transmembrane segment, and a short cytoplasmic motif associated with some integrins, but otherwise differ from the metazoan integrin α subunits. Among the available bacterial sequences, we found five examples containing seven sequential metazoan integrin-specific motifs within the seven repeats. The motifs differ in having one Ca2+-binding site per repeat, whereas metazoan integrins have three or four sites. The bacterial sequences are more conserved in terms of motif conservation and loop length, suggesting that the structure is more regular and compact than those example structures from human integrins. Although the bacterial examples are not full-length integrins, the full-length metazoan-type 7-bladed β-propeller domains are present, and sometimes two tandem copies are found. PMID:22022374

  18. Grading of Gliomas by Using Radiomic Features on Multiple Magnetic Resonance Imaging (MRI) Sequences.

    PubMed

    Qin, Jiang-Bo; Liu, Zhenyu; Zhang, Hui; Shen, Chen; Wang, Xiao-Chun; Tan, Yan; Wang, Shuo; Wu, Xiao-Feng; Tian, Jie

    2017-05-07

    BACKGROUND Gliomas are the most common primary brain neoplasms. Misdiagnosis occurs in glioma grading due to an overlap in conventional MRI manifestations. The aim of the present study was to evaluate the power of radiomic features based on multiple MRI sequences - T2-Weighted-Imaging-FLAIR (FLAIR), T1-Weighted-Imaging-Contrast-Enhanced (T1-CE), and Apparent Diffusion Coefficient (ADC) map - in glioma grading, and to improve the power of glioma grading by combining features. MATERIAL AND METHODS Sixty-six patients with histopathologically proven gliomas underwent T2-FLAIR and T1WI-CE sequence scanning with some patients (n=63) also undergoing DWI scanning. A total of 114 radiomic features were derived with radiomic methods by using in-house software. All radiomic features were compared between high-grade gliomas (HGGs) and low-grade gliomas (LGGs). Features with significant statistical differences were selected for receiver operating characteristic (ROC) curve analysis. The relationships between significantly different radiomic features and glial fibrillary acidic protein (GFAP) expression were evaluated. RESULTS A total of 8 radiomic features from 3 MRI sequences displayed significant differences between LGGs and HGGs. FLAIR GLCM Cluster Shade, T1-CE GLCM Entropy, and ADC GLCM Homogeneity were the best features to use in differentiating LGGs and HGGs in each MRI sequence. The combined feature was best able to differentiate LGGs and HGGs, which improved the accuracy of glioma grading compared to the above features in each MRI sequence. A significant correlation was found between GFAP and T1-CE GLCM Entropy, as well as between GFAP and ADC GLCM Homogeneity. CONCLUSIONS The combined radiomic feature had the highest efficacy in distinguishing LGGs from HGGs.

  19. Photosynthetic Mediterranean meadow orchids feature partial mycoheterotrophy and specific mycorrhizal associations.

    PubMed

    Girlanda, Mariangela; Segreto, Rossana; Cafasso, Donata; Liebel, Heiko Tobias; Rodda, Michele; Ercole, Enrico; Cozzolino, Salvatore; Gebauer, Gerhard; Perotto, Silvia

    2011-07-01

    We investigated whether four widespread, photosynthetic Mediterranean meadow orchids (Ophrys fuciflora, Anacamptis laxiflora, Orchis purpurea, and Serapias vomeracea) had either nutritional dependency on mycobionts or mycorrhizal fungal specificity. Nonphotosynthetic orchids generally engage in highly specific interactions with fungal symbionts that provide them with organic carbon. By contrast, fully photosynthetic orchids in sunny, meadow habitats have been considered to lack mycorrhizal specificity. We performed both culture-dependent and culture-independent ITS sequence analysis to identify fungi from orchid roots. By analyzing stable isotope ((13)C and (15)N) natural abundances, we also determined the degree of autotrophy and mycoheterotrophy in the four orchid species. Phylogenetic and multivariate comparisons indicated that Or. purpurea and Oph. fuciflora featured lower fungal diversity and more specific mycobiont spectra than A. laxiflora and S. vomeracea. All orchid species were significantly enriched in (15)N compared with neighboring non-orchid plants. Orchis purpurea had the most pronounced N gain from fungi and differed from the other orchids in also obtaining C from fungi. These results indicated that even in sunny Mediterranean meadows, orchids may be mycoheterotrophic, with correlated mycorrhizal fungal specificity.

  20. bpRNA: large-scale automated annotation and analysis of RNA secondary structure.

    PubMed

    Danaee, Padideh; Rouches, Mason; Wiley, Michelle; Deng, Dezhong; Huang, Liang; Hendrix, David

    2018-05-09

    While RNA secondary structure prediction from sequence data has made remarkable progress, there is a need for improved strategies for annotating the features of RNA secondary structures. Here, we present bpRNA, a novel annotation tool capable of parsing RNA structures, including complex pseudoknot-containing RNAs, to yield an objective, precise, compact, unambiguous, easily-interpretable description of all loops, stems, and pseudoknots, along with the positions, sequence, and flanking base pairs of each such structural feature. We also introduce several new informative representations of RNA structure types to improve structure visualization and interpretation. We have further used bpRNA to generate a web-accessible meta-database, 'bpRNA-1m', of over 100 000 single-molecule, known secondary structures; this is both more fully and accurately annotated and over 20-times larger than existing databases. We use a subset of the database with highly similar (≥90% identical) sequences filtered out to report on statistical trends in sequence, flanking base pairs, and length. Both the bpRNA method and the bpRNA-1m database will be valuable resources both for specific analysis of individual RNA molecules and large-scale analyses such as are useful for updating RNA energy parameters for computational thermodynamic predictions, improving machine learning models for structure prediction, and for benchmarking structure-prediction algorithms.

  1. SIGMAR1 mutation associated with autosomal recessive Silver-like syndrome

    PubMed Central

    Horga, Alejandro; Tomaselli, Pedro J.; Gonzalez, Michael A.; Laurà, Matilde; Muntoni, Francesco; Manzur, Adnan Y.; Hanna, Michael G.; Blake, Julian C.; Houlden, Henry; Züchner, Stephan

    2016-01-01

    Objective: To describe the genetic and clinical features of a simplex patient with distal hereditary motor neuropathy (dHMN) and lower limb spasticity (Silver-like syndrome) due to a mutation in the sigma nonopioid intracellular receptor–1 gene (SIGMAR1) and review the phenotypic spectrum of mutations in this gene. Methods: We used whole-exome sequencing to investigate the proband. The variants of interest were investigated for segregation in the family using Sanger sequencing. Subsequently, a larger cohort of 16 unrelated dHMN patients was specifically screened for SIGMAR1 mutations. Results: In the proband, we identified a homozygous missense variant (c.194T>A, p.Leu65Gln) in exon 2 of SIGMAR1 as the probable causative mutation. Pathogenicity is supported by evolutionary conservation, in silico analyses, and the strong phenotypic similarities with previously reported cases carrying coding sequence mutations in SIGMAR1. No other mutations were identified in 16 additional patients with dHMN. Conclusions: We suggest that coding sequence mutations in SIGMAR1 present clinically with a combination of dHMN and pyramidal tract signs, with or without spasticity, in the lower limbs. Preferential involvement of extensor muscles of the upper limbs may be a distinctive feature of the disease. These observations should be confirmed in future studies. PMID:27629094

  2. SIGMAR1 mutation associated with autosomal recessive Silver-like syndrome.

    PubMed

    Horga, Alejandro; Tomaselli, Pedro J; Gonzalez, Michael A; Laurà, Matilde; Muntoni, Francesco; Manzur, Adnan Y; Hanna, Michael G; Blake, Julian C; Houlden, Henry; Züchner, Stephan; Reilly, Mary M

    2016-10-11

    To describe the genetic and clinical features of a simplex patient with distal hereditary motor neuropathy (dHMN) and lower limb spasticity (Silver-like syndrome) due to a mutation in the sigma nonopioid intracellular receptor-1 gene (SIGMAR1) and review the phenotypic spectrum of mutations in this gene. We used whole-exome sequencing to investigate the proband. The variants of interest were investigated for segregation in the family using Sanger sequencing. Subsequently, a larger cohort of 16 unrelated dHMN patients was specifically screened for SIGMAR1 mutations. In the proband, we identified a homozygous missense variant (c.194T>A, p.Leu65Gln) in exon 2 of SIGMAR1 as the probable causative mutation. Pathogenicity is supported by evolutionary conservation, in silico analyses, and the strong phenotypic similarities with previously reported cases carrying coding sequence mutations in SIGMAR1. No other mutations were identified in 16 additional patients with dHMN. We suggest that coding sequence mutations in SIGMAR1 present clinically with a combination of dHMN and pyramidal tract signs, with or without spasticity, in the lower limbs. Preferential involvement of extensor muscles of the upper limbs may be a distinctive feature of the disease. These observations should be confirmed in future studies. © 2016 American Academy of Neurology.

  3. EffectorP: predicting fungal effector proteins from secretomes using machine learning.

    PubMed

    Sperschneider, Jana; Gardiner, Donald M; Dodds, Peter N; Tini, Francesco; Covarelli, Lorenzo; Singh, Karam B; Manners, John M; Taylor, Jennifer M

    2016-04-01

    Eukaryotic filamentous plant pathogens secrete effector proteins that modulate the host cell to facilitate infection. Computational effector candidate identification and subsequent functional characterization delivers valuable insights into plant-pathogen interactions. However, effector prediction in fungi has been challenging due to a lack of unifying sequence features such as conserved N-terminal sequence motifs. Fungal effectors are commonly predicted from secretomes based on criteria such as small size and cysteine-rich, which suffers from poor accuracy. We present EffectorP which pioneers the application of machine learning to fungal effector prediction. EffectorP improves fungal effector prediction from secretomes based on a robust signal of sequence-derived properties, achieving sensitivity and specificity of over 80%. Features that discriminate fungal effectors from secreted noneffectors are predominantly sequence length, molecular weight and protein net charge, as well as cysteine, serine and tryptophan content. We demonstrate that EffectorP is powerful when combined with in planta expression data for predicting high-priority effector candidates. EffectorP is the first prediction program for fungal effectors based on machine learning. Our findings will facilitate functional fungal effector studies and improve our understanding of effectors in plant-pathogen interactions. EffectorP is available at http://effectorp.csiro.au. © 2015 CSIRO New Phytologist © 2015 New Phytologist Trust.

  4. DNA minor groove electrostatic potential: influence of sequence-specific transitions of the torsion angle gamma and deoxyribose conformations.

    PubMed

    Zhitnikova, M Y; Shestopalova, A V

    2017-11-01

    The structural adjustments of the sugar-phosphate DNA backbone (switching of the γ angle (O5'-C5'-C4'-C3') from canonical to alternative conformations and/or C2'-endo → C3'-endo transition of deoxyribose) lead to the sequence-specific changes in accessible surface area of both polar and non-polar atoms of the grooves and the polar/hydrophobic profile of the latter ones. The distribution of the minor groove electrostatic potential is likely to be changing as a result of such conformational rearrangements in sugar-phosphate DNA backbone. Our analysis of the crystal structures of the short free DNA fragments and calculation of their electrostatic potentials allowed us to determine: (1) the number of classical and alternative γ angle conformations in the free B-DNA; (2) changes in the minor groove electrostatic potential, depending on the conformation of the sugar-phosphate DNA backbone; (3) the effect of the DNA sequence on the minor groove electrostatic potential. We have demonstrated that the structural adjustments of the DNA double helix (the conformations of the sugar-phosphate backbone and the minor groove dimensions) induce changes in the distribution of the minor groove electrostatic potential and are sequence-specific. Therefore, these features of the minor groove sizes and distribution of minor groove electrostatic potential can be used as a signal for recognition of the target DNA sequence by protein in the implementation of the indirect readout mechanism.

  5. Memory for sequences of events impaired in typical aging.

    PubMed

    Allen, Timothy A; Morris, Andrea M; Stark, Shauna M; Fortin, Norbert J; Stark, Craig E L

    2015-03-01

    Typical aging is associated with diminished episodic memory performance. To improve our understanding of the fundamental mechanisms underlying this age-related memory deficit, we previously developed an integrated, cross-species approach to link converging evidence from human and animal research. This novel approach focuses on the ability to remember sequences of events, an important feature of episodic memory. Unlike existing paradigms, this task is nonspatial, nonverbal, and can be used to isolate different cognitive processes that may be differentially affected in aging. Here, we used this task to make a comprehensive comparison of sequence memory performance between younger (18-22 yr) and older adults (62-86 yr). Specifically, participants viewed repeated sequences of six colored, fractal images and indicated whether each item was presented "in sequence" or "out of sequence." Several out of sequence probe trials were used to provide a detailed assessment of sequence memory, including: (i) repeating an item from earlier in the sequence ("Repeats"; e.g., AB A: DEF), (ii) skipping ahead in the sequence ("Skips"; e.g., AB D: DEF), and (iii) inserting an item from a different sequence into the same ordinal position ("Ordinal Transfers"; e.g., AB 3: DEF). We found that older adults performed as well as younger controls when tested on well-known and predictable sequences, but were severely impaired when tested using novel sequences. Importantly, overall sequence memory performance in older adults steadily declined with age, a decline not detected with other measures (RAVLT or BPS-O). We further characterized this deficit by showing that performance of older adults was severely impaired on specific probe trials that required detailed knowledge of the sequence (Skips and Ordinal Transfers), and was associated with a shift in their underlying mnemonic representation of the sequences. Collectively, these findings provide unambiguous evidence that the capacity to remember sequences of events is fundamentally affected by typical aging. © 2015 Allen et al.; Published by Cold Spring Harbor Laboratory Press.

  6. Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

    PubMed

    Li, Man; Ling, Cheng; Xu, Qi; Gao, Jingyang

    2018-02-01

    Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN .

  7. Identification, characterization and expression analysis of lineage-specific genes within sweet orange (Citrus sinensis).

    PubMed

    Xu, Yuantao; Wu, Guizhi; Hao, Baohai; Chen, Lingling; Deng, Xiuxin; Xu, Qiang

    2015-11-23

    With the availability of rapidly increasing number of genome and transcriptome sequences, lineage-specific genes (LSGs) can be identified and characterized. Like other conserved functional genes, LSGs play important roles in biological evolution and functions. Two set of citrus LSGs, 296 citrus-specific genes (CSGs) and 1039 orphan genes specific to sweet orange, were identified by comparative analysis between the sweet orange genome sequences and 41 genomes and 273 transcriptomes. With the two sets of genes, gene structure and gene expression pattern were investigated. On average, both the CSGs and orphan genes have fewer exons, shorter gene length and higher GC content when compared with those evolutionarily conserved genes (ECs). Expression profiling indicated that most of the LSGs expressed in various tissues of sweet orange and some of them exhibited distinct temporal and spatial expression patterns. Particularly, the orphan genes were preferentially expressed in callus, which is an important pluripotent tissue of citrus. Besides, part of the CSGs and orphan genes expressed responsive to abiotic stress, indicating their potential functions during interaction with environment. This study identified and characterized two sets of LSGs in citrus, dissected their sequence features and expression patterns, and provided valuable clues for future functional analysis of the LSGs in sweet orange.

  8. Role of the chromatin landscape and sequence in determining cell type-specific genomic glucocorticoid receptor binding and gene regulation

    PubMed Central

    Huska, Matthew R.; Jurk, Marcel; Schöpflin, Robert; Starick, Stephan R.; Schwahn, Kevin; Cooper, Samantha B.; Yamamoto, Keith R.; Thomas-Chollier, Morgane; Vingron, Martin

    2017-01-01

    Abstract The genomic loci bound by the glucocorticoid receptor (GR), a hormone-activated transcription factor, show little overlap between cell types. To study the role of chromatin and sequence in specifying where GR binds, we used Bayesian modeling within the universe of accessible chromatin. Taken together, our results uncovered that although GR preferentially binds accessible chromatin, its binding is biased against accessible chromatin located at promoter regions. This bias can only be explained partially by the presence of fewer GR recognition sequences, arguing for the existence of additional mechanisms that interfere with GR binding at promoters. Therefore, we tested the role of H3K9ac, the chromatin feature with the strongest negative association with GR binding, but found that this correlation does not reflect a causative link. Finally, we find a higher percentage of promoter–proximal GR binding for genes regulated by GR across cell types than for cell type-specific target genes. Given that GR almost exclusively binds accessible chromatin, we propose that cell type-specific regulation by GR preferentially occurs via distal enhancers, whose chromatin accessibility is typically cell type-specific, whereas ubiquitous target gene regulation is more likely to result from binding to promoter regions, which are often accessible regardless of cell type examined. PMID:27903902

  9. A brief introduction to web-based genome browsers.

    PubMed

    Wang, Jun; Kong, Lei; Gao, Ge; Luo, Jingchu

    2013-03-01

    Genome browser provides a graphical interface for users to browse, search, retrieve and analyze genomic sequence and annotation data. Web-based genome browsers can be classified into general genome browsers with multiple species and species-specific genome browsers. In this review, we attempt to give an overview for the main functions and features of web-based genome browsers, covering data visualization, retrieval, analysis and customization. To give a brief introduction to the multiple-species genome browser, we describe the user interface and main functions of the Ensembl and UCSC genome browsers using the human alpha-globin gene cluster as an example. We further use the MSU and the Rice-Map genome browsers to show some special features of species-specific genome browser, taking a rice transcription factor gene OsSPL14 as an example.

  10. htsint: a Python library for sequencing pipelines that combines data through gene set generation.

    PubMed

    Richards, Adam J; Herrel, Anthony; Bonneaud, Camille

    2015-09-24

    Sequencing technologies provide a wealth of details in terms of genes, expression, splice variants, polymorphisms, and other features. A standard for sequencing analysis pipelines is to put genomic or transcriptomic features into a context of known functional information, but the relationships between ontology terms are often ignored. For RNA-Seq, considering genes and their genetic variants at the group level enables a convenient way to both integrate annotation data and detect small coordinated changes between experimental conditions, a known caveat of gene level analyses. We introduce the high throughput data integration tool, htsint, as an extension to the commonly used gene set enrichment frameworks. The central aim of htsint is to compile annotation information from one or more taxa in order to calculate functional distances among all genes in a specified gene space. Spectral clustering is then used to partition the genes, thereby generating functional modules. The gene space can range from a targeted list of genes, like a specific pathway, all the way to an ensemble of genomes. Given a collection of gene sets and a count matrix of transcriptomic features (e.g. expression, polymorphisms), the gene sets produced by htsint can be tested for 'enrichment' or conditional differences using one of a number of commonly available packages. The database and bundled tools to generate functional modules were designed with sequencing pipelines in mind, but the toolkit nature of htsint allows it to also be used in other areas of genomics. The software is freely available as a Python library through GitHub at https://github.com/ajrichards/htsint.

  11. Complete sequence of HLA-B27 cDNA identified through the characterization of structural markers unique to the HLA-A, -B, and -C allelic series

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Szoets, H.; Reithmueller, G.; Weiss, E.

    1986-03-01

    Antigen HLA-B27 is a high-risk genetic factor with respect to a group of rheumatoid disorders, especially ankylosing spondylitis. A cDNA library was constructed from an autozygous B-cell line expressing HLA-B27, HLA-Cw1, and the previously cloned HLA-A2 antigen. Clones detected with an HLA probe were isolated and sorted into homology groups by differential hybridization and restriction maps. Nucleotide sequencing allowed the unambiguous assignment of cDNAs to HLA-A, -B, and -C loci. The HLA-B27 mRNA has the structure features and the codon variability typical of an HLA class I transcript but it specifies two uncommon amino acid replacements: a cysteine in positionmore » 67 and a serine in position 131. The latter substitution may have functional consequences, because it occurs in a conserved region and at a position invariably occupied by a species-specific arginine in humans and lysine in mice. The availability of the complete sequence of HLA-B27 and of the partial sequence of HLA-Cw1 allows the recognition of locus-specific sequence markers, particularly, but not exclusively, in the transmembrane and cytoplasmic domains.« less

  12. PubDNA Finder: a web database linking full-text articles to sequences of nucleic acids.

    PubMed

    García-Remesal, Miguel; Cuevas, Alejandro; Pérez-Rey, David; Martín, Luis; Anguita, Alberto; de la Iglesia, Diana; de la Calle, Guillermo; Crespo, José; Maojo, Víctor

    2010-11-01

    PubDNA Finder is an online repository that we have created to link PubMed Central manuscripts to the sequences of nucleic acids appearing in them. It extends the search capabilities provided by PubMed Central by enabling researchers to perform advanced searches involving sequences of nucleic acids. This includes, among other features (i) searching for papers mentioning one or more specific sequences of nucleic acids and (ii) retrieving the genetic sequences appearing in different articles. These additional query capabilities are provided by a searchable index that we created by using the full text of the 176 672 papers available at PubMed Central at the time of writing and the sequences of nucleic acids appearing in them. To automatically extract the genetic sequences occurring in each paper, we used an original method we have developed. The database is updated monthly by automatically connecting to the PubMed Central FTP site to retrieve and index new manuscripts. Users can query the database via the web interface provided. PubDNA Finder can be freely accessed at http://servet.dia.fi.upm.es:8080/pubdnafinder

  13. Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition.

    PubMed

    Ibrahim, Wisam; Abadeh, Mohammad Saniee

    2017-05-21

    Protein fold recognition is an important problem in bioinformatics to predict three-dimensional structure of a protein. One of the most challenging tasks in protein fold recognition problem is the extraction of efficient features from the amino-acid sequences to obtain better classifiers. In this paper, we have proposed six descriptors to extract features from protein sequences. These descriptors are applied in the first stage of a three-stage framework PCA-DELM-LDA to extract feature vectors from the amino-acid sequences. Principal Component Analysis PCA has been implemented to reduce the number of extracted features. The extracted feature vectors have been used with original features to improve the performance of the Deep Extreme Learning Machine DELM in the second stage. Four new features have been extracted from the second stage and used in the third stage by Linear Discriminant Analysis LDA to classify the instances into 27 folds. The proposed framework is implemented on the independent and combined feature sets in SCOP datasets. The experimental results show that extracted feature vectors in the first stage could improve the performance of DELM in extracting new useful features in second stage. Copyright © 2017 Elsevier Ltd. All rights reserved.

  14. Constraints on the evolution of a doublesex target gene arising from doublesex’s pleiotropic deployment

    PubMed Central

    Luo, Shengzhan D.; Baker, Bruce S.

    2015-01-01

    “Regulatory evolution,” that is, changes in a gene’s expression pattern through changes at its regulatory sequence, rather than changes at the coding sequence of the gene or changes of the upstream transcription factors, has been increasingly recognized as a pervasive evolution mechanism. Many somatic sexually dimorphic features of Drosophila melanogaster are the results of gene expression regulated by the doublesex (dsx) gene, which encodes sex-specific transcription factors (DSXF in females and DSXM in males). Rapid changes in such sexually dimorphic features are likely a result of changes at the regulatory sequence of the target genes. We focused on the Flavin-containing monooxygenase-2 (Fmo-2) gene, a likely direct dsx target, to elucidate how sexually dimorphic expression and its evolution are brought about. We found that dsx is deployed to regulate the Fmo-2 transcription both in the midgut and in fat body cells of the spermatheca (a female-specific tissue), through a canonical DSX-binding site in the Fmo-2 regulatory sequence. In the melanogaster group, Fmo-2 transcription in the midgut has evolved rapidly, in contrast to the conserved spermathecal transcription. We identified two cis-regulatory modules (CRM-p and CRM-d) that direct sexually monomorphic or dimorphic Fmo-2 transcription, respectively, in the midguts of these species. Changes of Fmo-2 transcription in the midgut from sexually dimorphic to sexually monomorphic in some species are caused by the loss of CRM-d function, but not the loss of the canonical DSX-binding site. Thus, conferring transcriptional regulation on a CRM level allows the regulation to evolve rapidly in one tissue while evading evolutionary constraints posed by other tissues. PMID:25675536

  15. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Samudrala, Ram; Heffron, Fred; McDermott, Jason E.

    2009-04-24

    The type III secretion system is an essential component for virulence in many Gram-negative bacteria. Though components of the secretion system apparatus are conserved, its substrates, effector proteins, are not. We have used a machine learning approach to identify new secreted effectors. The method integrates evolutionary measures, such as the pattern of homologs in a range of other organisms, and sequence-based features, such as G+C content, amino acid composition and the N-terminal 30 residues of the protein sequence. The method was trained on known effectors from Salmonella typhimurium and validated on a corresponding set of effectors from Pseudomonas syringae, aftermore » eliminating effectors with detectable sequence similarity. The method was able to identify all of the known effectors in P. syringae with a specificity of 84% and sensitivity of 82%. The reciprocal validation, training on P. syringae and validating on S. typhimurium, gave similar results with a specificity of 86% when the sensitivity level was 87%. These results show that type III effectors in disparate organisms share common features. We found that maximal performance is attained by including an N-terminal sequence of only 30 residues, which agrees with previous studies indicating that this region contains the secretion signal. We then used the method to define the most important residues in this putative secretion signal. Finally, we present novel predictions of secreted effectors in S. typhimurium, some of which have been experimentally validated, and apply the method to predict secreted effectors in the genetically intractable human pathogen Chlamydia trachomatis. This approach is a novel and effective way to identify secreted effectors in a broad range of pathogenic bacteria for further experimental characterization and provides insight into the nature of the type III secretion signal.« less

  16. Memory for sequences of events impaired in typical aging

    PubMed Central

    Allen, Timothy A.; Morris, Andrea M.; Stark, Shauna M.; Fortin, Norbert J.

    2015-01-01

    Typical aging is associated with diminished episodic memory performance. To improve our understanding of the fundamental mechanisms underlying this age-related memory deficit, we previously developed an integrated, cross-species approach to link converging evidence from human and animal research. This novel approach focuses on the ability to remember sequences of events, an important feature of episodic memory. Unlike existing paradigms, this task is nonspatial, nonverbal, and can be used to isolate different cognitive processes that may be differentially affected in aging. Here, we used this task to make a comprehensive comparison of sequence memory performance between younger (18–22 yr) and older adults (62–86 yr). Specifically, participants viewed repeated sequences of six colored, fractal images and indicated whether each item was presented “in sequence” or “out of sequence.” Several out of sequence probe trials were used to provide a detailed assessment of sequence memory, including: (i) repeating an item from earlier in the sequence (“Repeats”; e.g., ABADEF), (ii) skipping ahead in the sequence (“Skips”; e.g., ABDDEF), and (iii) inserting an item from a different sequence into the same ordinal position (“Ordinal Transfers”; e.g., AB3DEF). We found that older adults performed as well as younger controls when tested on well-known and predictable sequences, but were severely impaired when tested using novel sequences. Importantly, overall sequence memory performance in older adults steadily declined with age, a decline not detected with other measures (RAVLT or BPS-O). We further characterized this deficit by showing that performance of older adults was severely impaired on specific probe trials that required detailed knowledge of the sequence (Skips and Ordinal Transfers), and was associated with a shift in their underlying mnemonic representation of the sequences. Collectively, these findings provide unambiguous evidence that the capacity to remember sequences of events is fundamentally affected by typical aging. PMID:25691514

  17. Electromyographic Patterns during Golf Swing: Activation Sequence Profiling and Prediction of Shot Effectiveness.

    PubMed

    Verikas, Antanas; Vaiciukynas, Evaldas; Gelzinis, Adas; Parker, James; Olsson, M Charlotte

    2016-04-23

    This study analyzes muscle activity, recorded in an eight-channel electromyographic (EMG) signal stream, during the golf swing using a 7-iron club and exploits information extracted from EMG dynamics to predict the success of the resulting shot. Muscles of the arm and shoulder on both the left and right sides, namely flexor carpi radialis, extensor digitorum communis, rhomboideus and trapezius, are considered for 15 golf players (∼5 shots each). The method using Gaussian filtering is outlined for EMG onset time estimation in each channel and activation sequence profiling. Shots of each player revealed a persistent pattern of muscle activation. Profiles were plotted and insights with respect to player effectiveness were provided. Inspection of EMG dynamics revealed a pair of highest peaks in each channel as the hallmark of golf swing, and a custom application of peak detection for automatic extraction of swing segment was introduced. Various EMG features, encompassing 22 feature sets, were constructed. Feature sets were used individually and also in decision-level fusion for the prediction of shot effectiveness. The prediction of the target attribute, such as club head speed or ball carry distance, was investigated using random forest as the learner in detection and regression tasks. Detection evaluates the personal effectiveness of a shot with respect to the player-specific average, whereas regression estimates the value of target attribute, using EMG features as predictors. Fusion after decision optimization provided the best results: the equal error rate in detection was 24.3% for the speed and 31.7% for the distance; the mean absolute percentage error in regression was 3.2% for the speed and 6.4% for the distance. Proposed EMG feature sets were found to be useful, especially when used in combination. Rankings of feature sets indicated statistics for muscle activity in both the left and right body sides, correlation-based analysis of EMG dynamics and features derived from the properties of two highest peaks as important predictors of personal shot effectiveness. Activation sequence profiles helped in analyzing muscle orchestration during golf shot, exposing a specific avalanche pattern, but data from more players are needed for stronger conclusions. Results demonstrate that information arising from an EMG signal stream is useful for predicting golf shot success, in terms of club head speed and ball carry distance, with acceptable accuracy. Surface EMG data, collected with a goal to automatically evaluate golf player's performance, enables wearable computing in the field of ambient intelligence and has potential to enhance exercising of a long carry distance drive.

  18. Electromyographic Patterns during Golf Swing: Activation Sequence Profiling and Prediction of Shot Effectiveness

    PubMed Central

    Verikas, Antanas; Vaiciukynas, Evaldas; Gelzinis, Adas; Parker, James; Olsson, M. Charlotte

    2016-01-01

    This study analyzes muscle activity, recorded in an eight-channel electromyographic (EMG) signal stream, during the golf swing using a 7-iron club and exploits information extracted from EMG dynamics to predict the success of the resulting shot. Muscles of the arm and shoulder on both the left and right sides, namely flexor carpi radialis, extensor digitorum communis, rhomboideus and trapezius, are considered for 15 golf players (∼5 shots each). The method using Gaussian filtering is outlined for EMG onset time estimation in each channel and activation sequence profiling. Shots of each player revealed a persistent pattern of muscle activation. Profiles were plotted and insights with respect to player effectiveness were provided. Inspection of EMG dynamics revealed a pair of highest peaks in each channel as the hallmark of golf swing, and a custom application of peak detection for automatic extraction of swing segment was introduced. Various EMG features, encompassing 22 feature sets, were constructed. Feature sets were used individually and also in decision-level fusion for the prediction of shot effectiveness. The prediction of the target attribute, such as club head speed or ball carry distance, was investigated using random forest as the learner in detection and regression tasks. Detection evaluates the personal effectiveness of a shot with respect to the player-specific average, whereas regression estimates the value of target attribute, using EMG features as predictors. Fusion after decision optimization provided the best results: the equal error rate in detection was 24.3% for the speed and 31.7% for the distance; the mean absolute percentage error in regression was 3.2% for the speed and 6.4% for the distance. Proposed EMG feature sets were found to be useful, especially when used in combination. Rankings of feature sets indicated statistics for muscle activity in both the left and right body sides, correlation-based analysis of EMG dynamics and features derived from the properties of two highest peaks as important predictors of personal shot effectiveness. Activation sequence profiles helped in analyzing muscle orchestration during golf shot, exposing a specific avalanche pattern, but data from more players are needed for stronger conclusions. Results demonstrate that information arising from an EMG signal stream is useful for predicting golf shot success, in terms of club head speed and ball carry distance, with acceptable accuracy. Surface EMG data, collected with a goal to automatically evaluate golf player’s performance, enables wearable computing in the field of ambient intelligence and has potential to enhance exercising of a long carry distance drive. PMID:27120604

  19. PrPC Governs Susceptibility to Prion Strains in Bank Vole, While Other Host Factors Modulate Strain Features.

    PubMed

    Espinosa, J C; Nonno, R; Di Bari, M; Aguilar-Calvo, P; Pirisinu, L; Fernández-Borges, N; Vanni, I; Vaccari, G; Marín-Moreno, A; Frassanito, P; Lorenzo, P; Agrimi, U; Torres, J M

    2016-12-01

    Bank vole is a rodent species that shows differential susceptibility to the experimental transmission of different prion strains. In this work, the transmission features of a panel of diverse prions with distinct origins were assayed both in bank vole expressing methionine at codon 109 (Bv109M) and in transgenic mice expressing physiological levels of bank vole PrP C (the BvPrP-Tg407 mouse line). This work is the first systematic comparison of the transmission features of a collection of prion isolates, representing a panel of diverse prion strains, in a transgenic-mouse model and in its natural counterpart. The results showed very similar transmission properties in both the natural species and the transgenic-mouse model, demonstrating the key role of the PrP amino acid sequence in prion transmission susceptibility. However, differences in the PrP Sc types propagated by Bv109M and BvPrP-Tg407 suggest that host factors other than PrP C modulate prion strain features. The differential susceptibility of bank voles to prion strains can be modeled in transgenic mice, suggesting that this selective susceptibility is controlled by the vole PrP sequence alone rather than by other species-specific factors. Differences in the phenotypes observed after prion transmissions in bank voles and in the transgenic mice suggest that host factors other than the PrP C sequence may affect the selection of the substrain replicating in the animal model. Copyright © 2016, American Society for Microbiology. All Rights Reserved.

  20. A TALE-inspired computational screen for proteins that contain approximate tandem repeats.

    PubMed

    Perycz, Malgorzata; Krwawicz, Joanna; Bochtler, Matthias

    2017-01-01

    TAL (transcription activator-like) effectors (TALEs) are bacterial proteins that are secreted from bacteria to plant cells to act as transcriptional activators. TALEs and related proteins (RipTALs, BurrH, MOrTL1 and MOrTL2) contain approximate tandem repeats that differ in conserved positions that define specificity. Using PERL, we screened ~47 million protein sequences for TALE-like architecture characterized by approximate tandem repeats (between 30 and 43 amino acids in length) and sequence variability in conserved positions, without requiring sequence similarity to TALEs. Candidate proteins were scored according to their propensity for nuclear localization, secondary structure, repeat sequence complexity, as well as covariation and predicted structural proximity of variable residues. Biological context was tentatively inferred from co-occurrence of other domains and interactome predictions. Approximate repeats with TALE-like features that merit experimental characterization were found in a protein of chestnut blight fungus, a eukaryotic plant pathogen.

  1. A TALE-inspired computational screen for proteins that contain approximate tandem repeats

    PubMed Central

    Krwawicz, Joanna

    2017-01-01

    TAL (transcription activator-like) effectors (TALEs) are bacterial proteins that are secreted from bacteria to plant cells to act as transcriptional activators. TALEs and related proteins (RipTALs, BurrH, MOrTL1 and MOrTL2) contain approximate tandem repeats that differ in conserved positions that define specificity. Using PERL, we screened ~47 million protein sequences for TALE-like architecture characterized by approximate tandem repeats (between 30 and 43 amino acids in length) and sequence variability in conserved positions, without requiring sequence similarity to TALEs. Candidate proteins were scored according to their propensity for nuclear localization, secondary structure, repeat sequence complexity, as well as covariation and predicted structural proximity of variable residues. Biological context was tentatively inferred from co-occurrence of other domains and interactome predictions. Approximate repeats with TALE-like features that merit experimental characterization were found in a protein of chestnut blight fungus, a eukaryotic plant pathogen. PMID:28617832

  2. Genomic sequences of murine gamma B- and gamma C-crystallin-encoding genes: promoter analysis and complete evolutionary pattern of mouse, rat and human gamma-crystallins.

    PubMed

    Graw, J; Liebstein, A; Pietrowski, D; Schmitt-John, T; Werner, T

    1993-12-22

    The murine genes, gamma B-cry and gamma C-cry, encoding the gamma B- and gamma C-crystallins, were isolated from a genomic DNA library. The complete nucleotide (nt) sequences of both genes were determined from 661 and 711 bp, respectively, upstream from the first exon to the corresponding polyadenylation sites, comprising more than 2650 and 2890 bp, respectively. The new sequences were compared to the partial cDNA sequences available for the murine gamma B-cry and gamma C-cry, as well as to the corresponding genomic sequences from rat and man, at both the nt and predicted amino acid (aa) sequence levels. In the gamma B-cry promoter region, a canonical CCAAT-box, a TATA-box, putative NF-I and C/EBP sites were detected. An R-repeat is inserted 366 bp upstream from the transcription start point. In contrast, the gamma C-cry promoter does not contain a CCAAT-box, but some other putative binding sites for transcription factors (AP-2, UBP-1, LBP-1) were located by computer analysis. The promoter regions of all six gamma-cry from mouse, rat and human, except human psi gamma F-cry, were analyzed for common sequence elements. A complex sequence element of about 70-80 bp was found in the proximal promoter, which contains a gamma-cry-specific and almost invariant sequence (crygpel) of 14 nt, and ends with the also invariant TATA-box. Within the complex sequence element, a minimum of three further features specific for the gamma A-, gamma B- and gamma D/E/F-cry genes can be defined, at least two of which were recently shown to be functional. In addition to these four sequence elements, a subtype-specific structure of inverted repeats with different-sized spacers can be deduced from the multiple sequence alignment. A phylogenetic analysis based on the promoter region, as well as the complete exon 3 of all gamma-cry from mouse, rat and man, suggests separation of only five gamma-cry subtypes (gamma A-, gamma B-, gamma C-, gamma D- and gamma E/F-cry) prior to species separation.

  3. REDIdb: the RNA editing database.

    PubMed

    Picardi, Ernesto; Regina, Teresa Maria Rosaria; Brennicke, Axel; Quagliariello, Carla

    2007-01-01

    The RNA Editing Database (REDIdb) is an interactive, web-based database created and designed with the aim to allocate RNA editing events such as substitutions, insertions and deletions occurring in a wide range of organisms. The database contains both fully and partially sequenced DNA molecules for which editing information is available either by experimental inspection (in vitro) or by computational detection (in silico). Each record of REDIdb is organized in a specific flat-file containing a description of the main characteristics of the entry, a feature table with the editing events and related details and a sequence zone with both the genomic sequence and the corresponding edited transcript. REDIdb is a relational database in which the browsing and identification of editing sites has been simplified by means of two facilities to either graphically display genomic or cDNA sequences or to show the corresponding alignment. In both cases, all editing sites are highlighted in colour and their relative positions are detailed by mousing over. New editing positions can be directly submitted to REDIdb after a user-specific registration to obtain authorized secure access. This first version of REDIdb database stores 9964 editing events and can be freely queried at http://biologia.unical.it/py_script/search.html.

  4. Identification and removal of low-complexity sites in allele-specific analysis of ChIP-seq data.

    PubMed

    Waszak, Sebastian M; Kilpinen, Helena; Gschwind, Andreas R; Orioli, Andrea; Raghav, Sunil K; Witwicki, Robert M; Migliavacca, Eugenia; Yurovsky, Alisa; Lappalainen, Tuuli; Hernandez, Nouria; Reymond, Alexandre; Dermitzakis, Emmanouil T; Deplancke, Bart

    2014-01-15

    High-throughput sequencing technologies enable the genome-wide analysis of the impact of genetic variation on molecular phenotypes at unprecedented resolution. However, although powerful, these technologies can also introduce unexpected artifacts. We investigated the impact of library amplification bias on the identification of allele-specific (AS) molecular events from high-throughput sequencing data derived from chromatin immunoprecipitation assays (ChIP-seq). Putative AS DNA binding activity for RNA polymerase II was determined using ChIP-seq data derived from lymphoblastoid cell lines of two parent-daughter trios. We found that, at high-sequencing depth, many significant AS binding sites suffered from an amplification bias, as evidenced by a larger number of clonal reads representing one of the two alleles. To alleviate this bias, we devised an amplification bias detection strategy, which filters out sites with low read complexity and sites featuring a significant excess of clonal reads. This method will be useful for AS analyses involving ChIP-seq and other functional sequencing assays. The R package abs filter for library clonality simulations and detection of amplification-biased sites is available from http://updepla1srv1.epfl.ch/waszaks/absfilter

  5. Programmable DNA-binding proteins from Burkholderia provide a fresh perspective on the TALE-like repeat domain

    PubMed Central

    de Lange, Orlando; Wolf, Christina; Dietze, Jörn; Elsaesser, Janett; Morbitzer, Robert; Lahaye, Thomas

    2014-01-01

    The tandem repeats of transcription activator like effectors (TALEs) mediate sequence-specific DNA binding using a simple code. Naturally, TALEs are injected by Xanthomonas bacteria into plant cells to manipulate the host transcriptome. In the laboratory TALE DNA binding domains are reprogrammed and used to target a fused functional domain to a genomic locus of choice. Research into the natural diversity of TALE-like proteins may provide resources for the further improvement of current TALE technology. Here we describe TALE-like proteins from the endosymbiotic bacterium Burkholderia rhizoxinica, termed Bat proteins. Bat repeat domains mediate sequence-specific DNA binding with the same code as TALEs, despite less than 40% sequence identity. We show that Bat proteins can be adapted for use as transcription factors and nucleases and that sequence preferences can be reprogrammed. Unlike TALEs, the core repeats of each Bat protein are highly polymorphic. This feature allowed us to explore alternative strategies for the design of custom Bat repeat arrays, providing novel insights into the functional relevance of non-RVD residues. The Bat proteins offer fertile grounds for research into the creation of improved programmable DNA-binding proteins and comparative insights into TALE-like evolution. PMID:24792163

  6. Molecular and bioinformatic analysis of the FB-NOF transposable element.

    PubMed

    Badal, Martí; Portela, Anna; Xamena, Noel; Cabré, Oriol

    2006-04-12

    The Drosophila melanogaster transposable element FB-NOF is known to play a role in genome plasticity through the generation of all sort of genomic rearrangements. Moreover, several insertional mutants due to FB mobilizations have been reported. Its structure and sequence, however, have been poorly studied mainly as a consequence of the long, complex and repetitive sequence of FB inverted repeats. This repetitive region is composed of several 154 bp blocks, each with five almost identical repeats. In this paper, we report the sequencing process of 2 kb long FB inverted repeats of a complete FB-NOF element, with high precision and reliability. This achievement has been possible using a new map of the FB repetitive region, which identifies unambiguously each repeat with new features that can be used as landmarks. With this new vision of the element, a list of FB-NOF in the D. melanogaster genomic clones has been done, improving previous works that used only bioinformatic algorithms. The availability of many FB and FB-NOF sequences allowed an analysis of the FB insertion sequences that showed no sequence specificity, but a preference for A/T rich sequences. The position of NOF into FB is also studied, revealing that it is always located after a second repeat in a random block. With the results of this analysis, we propose a model of transposition in which NOF jumps from FB to FB, using an unidentified transposase enzyme that should specifically recognize the second repeat end of the FB blocks.

  7. Robot acting on moving bodies (RAMBO): Preliminary results

    NASA Technical Reports Server (NTRS)

    Davis, Larry S.; Dementhon, Daniel; Bestul, Thor; Ziavras, Sotirios; Srinivasan, H. V.; Siddalingaiah, Madju; Harwood, David

    1989-01-01

    A robot system called RAMBO is being developed. It is equipped with a camera, which, given a sequence of simple tasks, can perform these tasks on a moving object. RAMBO is given a complete geometric model of the object. A low level vision module extracts and groups characteristic features in images of the object. The positions of the object are determined in a sequence of images, and a motion estimate of the object is obtained. This motion estimate is used to plan trajectories of the robot tool to relative locations nearby the object sufficient for achieving the tasks. More specifically, low level vision uses parallel algorithms for image enchancement by symmetric nearest neighbor filtering, edge detection by local gradient operators, and corner extraction by sector filtering. The object pose estimation is a Hough transform method accumulating position hypotheses obtained by matching triples of image features (corners) to triples of model features. To maximize computing speed, the estimate of the position in space of a triple of features is obtained by decomposing its perspective view into a product of rotations and a scaled orthographic projection. This allows the use of 2-D lookup tables at each stage of the decomposition. The position hypotheses for each possible match of model feature triples and image feature triples are calculated in parallel. Trajectory planning combines heuristic and dynamic programming techniques. Then trajectories are created using parametric cubic splines between initial and goal trajectories. All the parallel algorithms run on a Connection Machine CM-2 with 16K processors.

  8. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

    PubMed

    Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.

  9. Pms2 Suppresses Large Expansions of the (GAA·TTC)n Sequence in Neuronal Tissues

    PubMed Central

    Bourn, Rebecka L.; De Biase, Irene; Pinto, Ricardo Mouro; Sandi, Chiranjeevi; Al-Mahdawi, Sahar; Pook, Mark A.; Bidichandani, Sanjay I.

    2012-01-01

    Expanded trinucleotide repeat sequences are the cause of several inherited neurodegenerative diseases. Disease pathogenesis is correlated with several features of somatic instability of these sequences, including further large expansions in postmitotic tissues. The presence of somatic expansions in postmitotic tissues is consistent with DNA repair being a major determinant of somatic instability. Indeed, proteins in the mismatch repair (MMR) pathway are required for instability of the expanded (CAG·CTG)n sequence, likely via recognition of intrastrand hairpins by MutSβ. It is not clear if or how MMR would affect instability of disease-causing expanded trinucleotide repeat sequences that adopt secondary structures other than hairpins, such as the triplex/R-loop forming (GAA·TTC)n sequence that causes Friedreich ataxia. We analyzed somatic instability in transgenic mice that carry an expanded (GAA·TTC)n sequence in the context of the human FXN locus and lack the individual MMR proteins Msh2, Msh6 or Pms2. The absence of Msh2 or Msh6 resulted in a dramatic reduction in somatic mutations, indicating that mammalian MMR promotes instability of the (GAA·TTC)n sequence via MutSα. The absence of Pms2 resulted in increased accumulation of large expansions in the nervous system (cerebellum, cerebrum, and dorsal root ganglia) but not in non-neuronal tissues (heart and kidney), without affecting the prevalence of contractions. Pms2 suppressed large expansions specifically in tissues showing MutSα-dependent somatic instability, suggesting that they may act on the same lesion or structure associated with the expanded (GAA·TTC)n sequence. We conclude that Pms2 specifically suppresses large expansions of a pathogenic trinucleotide repeat sequence in neuronal tissues, possibly acting independently of the canonical MMR pathway. PMID:23071719

  10. Pms2 suppresses large expansions of the (GAA·TTC)n sequence in neuronal tissues.

    PubMed

    Bourn, Rebecka L; De Biase, Irene; Pinto, Ricardo Mouro; Sandi, Chiranjeevi; Al-Mahdawi, Sahar; Pook, Mark A; Bidichandani, Sanjay I

    2012-01-01

    Expanded trinucleotide repeat sequences are the cause of several inherited neurodegenerative diseases. Disease pathogenesis is correlated with several features of somatic instability of these sequences, including further large expansions in postmitotic tissues. The presence of somatic expansions in postmitotic tissues is consistent with DNA repair being a major determinant of somatic instability. Indeed, proteins in the mismatch repair (MMR) pathway are required for instability of the expanded (CAG·CTG)(n) sequence, likely via recognition of intrastrand hairpins by MutSβ. It is not clear if or how MMR would affect instability of disease-causing expanded trinucleotide repeat sequences that adopt secondary structures other than hairpins, such as the triplex/R-loop forming (GAA·TTC)(n) sequence that causes Friedreich ataxia. We analyzed somatic instability in transgenic mice that carry an expanded (GAA·TTC)(n) sequence in the context of the human FXN locus and lack the individual MMR proteins Msh2, Msh6 or Pms2. The absence of Msh2 or Msh6 resulted in a dramatic reduction in somatic mutations, indicating that mammalian MMR promotes instability of the (GAA·TTC)(n) sequence via MutSα. The absence of Pms2 resulted in increased accumulation of large expansions in the nervous system (cerebellum, cerebrum, and dorsal root ganglia) but not in non-neuronal tissues (heart and kidney), without affecting the prevalence of contractions. Pms2 suppressed large expansions specifically in tissues showing MutSα-dependent somatic instability, suggesting that they may act on the same lesion or structure associated with the expanded (GAA·TTC)(n) sequence. We conclude that Pms2 specifically suppresses large expansions of a pathogenic trinucleotide repeat sequence in neuronal tissues, possibly acting independently of the canonical MMR pathway.

  11. Noise-gating to Clean Astrophysical Image Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    DeForest, C. E.

    I present a family of algorithms to reduce noise in astrophysical images and image sequences, preserving more information from the original data than is retained by conventional techniques. The family uses locally adaptive filters (“noise gates”) in the Fourier domain to separate coherent image structure from background noise based on the statistics of local neighborhoods in the image. Processing of solar data limited by simple shot noise or by additive noise reveals image structure not easily visible in the originals, preserves photometry of observable features, and reduces shot noise by a factor of 10 or more with little to nomore » apparent loss of resolution. This reveals faint features that were either not directly discernible or not sufficiently strongly detected for quantitative analysis. The method works best on image sequences containing related subjects, for example movies of solar evolution, but is also applicable to single images provided that there are enough pixels. The adaptive filter uses the statistical properties of noise and of local neighborhoods in the data to discriminate between coherent features and incoherent noise without reference to the specific shape or evolution of those features. The technique can potentially be modified in a straightforward way to exploit additional a priori knowledge about the functional form of the noise.« less

  12. Noise-gating to Clean Astrophysical Image Data

    NASA Astrophysics Data System (ADS)

    DeForest, C. E.

    2017-04-01

    I present a family of algorithms to reduce noise in astrophysical images and image sequences, preserving more information from the original data than is retained by conventional techniques. The family uses locally adaptive filters (“noise gates”) in the Fourier domain to separate coherent image structure from background noise based on the statistics of local neighborhoods in the image. Processing of solar data limited by simple shot noise or by additive noise reveals image structure not easily visible in the originals, preserves photometry of observable features, and reduces shot noise by a factor of 10 or more with little to no apparent loss of resolution. This reveals faint features that were either not directly discernible or not sufficiently strongly detected for quantitative analysis. The method works best on image sequences containing related subjects, for example movies of solar evolution, but is also applicable to single images provided that there are enough pixels. The adaptive filter uses the statistical properties of noise and of local neighborhoods in the data to discriminate between coherent features and incoherent noise without reference to the specific shape or evolution of those features. The technique can potentially be modified in a straightforward way to exploit additional a priori knowledge about the functional form of the noise.

  13. Multiple conformations of the cytidine repressor DNA-binding domain coalesce to one upon recognition of a specific DNA surface.

    PubMed

    Moody, Colleen L; Tretyachenko-Ladokhina, Vira; Laue, Thomas M; Senear, Donald F; Cocco, Melanie J

    2011-08-09

    The cytidine repressor (CytR) is a member of the LacR family of bacterial repressors with distinct functional features. The Escherichia coli CytR regulon comprises nine operons whose palindromic operators vary in both sequence and, most significantly, spacing between the recognition half-sites. This suggests a strong likelihood that protein folding would be coupled to DNA binding as a mechanism to accommodate the variety of different operator architectures to which CytR is targeted. Such coupling is a common feature of sequence-specific DNA-binding proteins, including the LacR family repressors; however, there are no significant structural rearrangements upon DNA binding within the three-helix DNA-binding domains (DBDs) studied to date. We used nuclear magnetic resonance (NMR) spectroscopy to characterize the CytR DBD free in solution and to determine the high-resolution structure of a CytR DBD monomer bound specifically to one DNA half-site of the uridine phosphorylase (udp) operator. We find that the free DBD populates multiple distinct conformations distinguished by up to four sets of NMR peaks per residue. This structural heterogeneity is previously unknown in the LacR family. These stable structures coalesce into a single, more stable udp-bound form that features a three-helix bundle containing a canonical helix-turn-helix motif. However, this structure differs from all other LacR family members whose structures are known with regard to the packing of the helices and consequently their relative orientations. Aspects of CytR activity are unique among repressors; we identify here structural properties that are also distinct and that might underlie the different functional properties. © 2011 American Chemical Society

  14. Pseudomonas syringae pv. actinidiae Draft Genomes Comparison Reveal Strain-Specific Features Involved in Adaptation and Virulence to Actinidia Species

    PubMed Central

    Marcelletti, Simone; Ferrante, Patrizia; Petriccione, Milena; Firrao, Giuseppe; Scortichini, Marco

    2011-01-01

    A recent re-emerging bacterial canker disease incited by Pseudomonas syringae pv. actinidiae (Psa) is causing severe economic losses to Actinidia chinensis and A. deliciosa cultivations in southern Europe, New Zealand, Chile and South Korea. Little is known about the genetic features of this pathovar. We generated genome-wide Illumina sequence data from two Psa strains causing outbreaks of bacterial canker on the A. deliciosa cv. Hayward in Japan (J-Psa, type-strain of the pathovar) and in Italy (I-Psa) in 1984 and 1992, respectively as well as from a Psa strain (I2-Psa) isolated at the beginning of the recent epidemic on A. chinensis cv. Hort16A in Italy. All strains were isolated from typical leaf spot symptoms. The phylogenetic relationships revealed that Psa is more closely related to P. s. pv. theae than to P. avellanae within genomospecies 8. Comparative genomic analyses revealed both relevant intrapathovar variations and putative pathovar-specific genomic regions in Psa. The genomic sequences of J-Psa and I-Psa were very similar. Conversely, the I2-Psa genome encodes four additional effector protein genes, lacks a 50 kb plasmid and the phaseolotoxin gene cluster, argK-tox but has acquired a 160 kb plasmid and putative prophage sequences. Several lines of evidence from the analysis of the genome sequences support the hypothesis that this strain did not evolve from the Psa population that caused the epidemics in 1984–1992 in Japan and Italy but rather is the product of a recent independent evolution of the pathovar actinidiae for infecting Actinidia spp. All Psa strains share the genetic potential for copper resistance, antibiotic detoxification, high affinity iron acquisition and detoxification of nitric oxide of plant origin. Similar to other sequenced phytopathogenic pseudomonads associated with woody plant species, the Psa strains isolated from leaves also display a set of genes involved in the catabolism of plant-derived aromatic compounds. PMID:22132095

  15. Identification of mediator complex 26 (Crsp7) gametologs on platypus X1 and Y5 sex chromosomes: a candidate testis-determining gene in monotremes?

    PubMed

    Tsend-Ayush, Enkhjargal; Kortschak, R Daniel; Bernard, Pascal; Lim, Shu Ly; Ryan, Janelle; Rosenkranz, Ruben; Borodina, Tatiana; Dohm, Juliane C; Himmelbauer, Heinz; Harley, Vincent R; Grützner, Frank

    2012-01-01

    The basal lineage of monotremes features an extraordinarily complex sex chromosome system which has provided novel insights into the evolution of mammalian sex chromosomes. Recently, sequence information from autosomes, X chromosomes, and XY-shared pseudoautosomal regions has become available. However, no gene has so far been described on any of the Y chromosome-specific regions. We analyzed sequences derived from Y-specific BAC clones to identify genes with potentially male-specific function. Here, we report the identification and characterization of the mediator complex protein gametologs on platypus Y5 (Crspy). We also identified the X-chromosomal copy which unexpectedly maps to X1 (Crspx). Sequence comparison shows extensive divergence between the X and Y copy, but we found no significant positive selection on either gametolog. Expression analysis shows widespread expression of Crspx. Crspy is expressed exclusively in males with particularly strong expression in testis and kidney. Reporter gene assays to investigate whether Crspx/y can act on the recently discovered mouse Sox9 testis-specific enhancer element did reveal a modest effect together with mouse Sox9 + Sf1, but showed overall no significant upregulation of the reporter gene. This is the first report of a differentiated functional male-specific gene on platypus Y chromosomes, providing new insights into sex chromosome evolution and a candidate gene for male-specific function in monotremes.

  16. Pharyngeal mis-sequencing in dysphagia: characteristics, rehabilitative response, and etiological speculation.

    PubMed

    Huckabee, Maggie-Lee; Lamvik, Kristin; Jones, Richard

    2014-08-15

    Clinical data are submitted as documentation of a pathophysiologic feature of dysphagia termed pharyngeal mis-sequencing and to encourage clinicians and researchers to adopt more critical approaches to diagnosis and treatment planning. Recent clinical experience has identified a cohort of patients who present with an atypical dysphagia not specifically described in the literature: mis-sequenced constriction of the pharynx when swallowing. As a result, they are unable to coordinate streamlined bolus transfer from the pharynx into the esophagus. This mis-sequencing contributes to nasal redirection, aspiration, and, for some, the inability to safely tolerate an oral diet. Sixteen patients (8 females, 8 males), with a mean age of 44 years (range=25-78), had an average time post-onset of 23 months (range=2-72) at initiation of intensive rehabilitation. A 3-channel manometric catheter was used to measure pharyngeal pressure. The average peak-to-peak latency between nadir pressures at sensor-1 and sensor-2 was 15 ms (95% CI, -2 to 33 ms), compared to normative mean latency of 239 ms (95% CI, 215 to 263 ms). Rehabilitative responses are summarized, along with a single detailed case report. It is unclear from these data if pharyngeal mis-sequencing is (i) a pathological feature of impaired motor planning from brainstem damage or (ii) a maladaptive compensation developed in response to chronic dysphagia. Future investigation is needed to provide a full report of pharyngeal mis-sequencing, and the implications on our understanding of underlying neural control of swallowing. Copyright © 2014 Elsevier B.V. All rights reserved.

  17. A reduced amino acid alphabet for understanding and designing protein adaptation to mutation.

    PubMed

    Etchebest, C; Benros, C; Bornot, A; Camproux, A-C; de Brevern, A G

    2007-11-01

    Protein sequence world is considerably larger than structure world. In consequence, numerous non-related sequences may adopt similar 3D folds and different kinds of amino acids may thus be found in similar 3D structures. By grouping together the 20 amino acids into a smaller number of representative residues with similar features, sequence world simplification may be achieved. This clustering hence defines a reduced amino acid alphabet (reduced AAA). Numerous works have shown that protein 3D structures are composed of a limited number of building blocks, defining a structural alphabet. We previously identified such an alphabet composed of 16 representative structural motifs (5-residues length) called Protein Blocks (PBs). This alphabet permits to translate the structure (3D) in sequence of PBs (1D). Based on these two concepts, reduced AAA and PBs, we analyzed the distributions of the different kinds of amino acids and their equivalences in the structural context. Different reduced sets were considered. Recurrent amino acid associations were found in all the local structures while other were specific of some local structures (PBs) (e.g Cysteine, Histidine, Threonine and Serine for the alpha-helix Ncap). Some similar associations are found in other reduced AAAs, e.g Ile with Val, or hydrophobic aromatic residues Trp with Phe and Tyr. We put into evidence interesting alternative associations. This highlights the dependence on the information considered (sequence or structure). This approach, equivalent to a substitution matrix, could be useful for designing protein sequence with different features (for instance adaptation to environment) while preserving mainly the 3D fold.

  18. Landscape of Insertion Polymorphisms in the Human Genome

    PubMed Central

    Onozawa, Masahiro; Goldberg, Liat; Aplan, Peter D.

    2015-01-01

    Nucleotide substitutions, small (<50 bp) insertions or deletions (indels), and large (>50 bp) deletions are well-known causes of genetic variation within the human genome. We recently reported a previously unrecognized form of polymorphic insertions, termed templated sequence insertion polymorphism (TSIP), in which the inserted sequence was templated from a distant genomic region, and was inserted in the genome through reverse transcription of an RNA intermediate. TSIPs can be grouped into two classes based on nucleotide sequence features at the insertion junctions; class 1 TSIPs show target site duplication, polyadenylation, and preference for insertion at a 5′-TTTT/A-3′ sequence, suggesting a LINE-1 based insertion mechanism, whereas class 2 TSIPs show features consistent with repair of a DNA double strand break by nonhomologous end joining. To gain a more complete picture of TSIPs throughout the human population, we evaluated whole-genome sequence from 52 individuals, and identified 171 TSIPs. Most individuals had 25–30 TSIPs, and common (present in >20% of individuals) TSIPs were found in individuals throughout the world, whereas rare TSIPs tended to cluster in specific geographic regions. The number of rare TSIPs was greater than the number of common TSIPs, suggesting that TSIP generation is an ongoing process. Intriguingly, mitochondrial sequences were a frequent template for class 2 insertions, used more commonly than any nuclear chromosome. Similar to single nucleotide polymorphisms and indels, we suspect that these TSIPs may be important for the generation of human diversity and genetic diseases, and can be useful in tracking historical migration of populations. PMID:25745018

  19. A computational analysis of the three isoforms of glutamate dehydrogenase reveals structural features of the isoform EC 1.4.1.4 supporting a key role in ammonium assimilation by plants

    PubMed Central

    Jaspard, Emmanuel

    2006-01-01

    Background There are three isoforms of glutamate dehydrogenase. The isoform EC 1.4.1.4 (GDH4) catalyses glutamate synthesis from 2-oxoglutarate and ammonium, using NAD(P)H. Ammonium assimilation is critical for plant growth. Although GDH4 from animals and prokaryotes are well characterized, there are few data concerning plant GDH4, even from those whose genomes are well annotated. Results A large set of the three GDH isoforms was built resulting in 116 non-redundant full polypeptide sequences. A computational analysis was made to gain more information concerning the structure – function relationship of GDH4 from plants (Eukaryota, Viridiplantae). The tested plant GDH4 sequences were the two ones known to date, those of Chlorella sorokiniana. This analysis revealed several structural features specific of plant GDH4: (i) the lack of a structure called "antenna"; (ii) the NAD(P)-binding motif GAGNVA; and (iii) a second putative coenzyme-binding motif GVLTGKG together with four residues involved in the binding of the reduced form of NADP. Conclusion A number of structural features specific of plant GDH4 have been found. The results reinforce the probable key role of GDH4 in ammonium assimilation by plants. Reviewers This article was reviewed by Tina Bakolitsa (nominated by Eugene Koonin), Martin Jambon (nominated by Laura Landweber), Sandor Pangor and Franck Eisenhaber. PMID:17173671

  20. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sikorski, Johannes; Lapidus, Alla L.; Copeland, A

    Segniliparus rotundus Butler 2005 is the type species of the genus Segniliparus, which is cur-rently the only genus in the corynebacterial family Segniliparaceae. This family is of large in-terest because of a novel late-emerging genus-specific mycolate pattern. The type strain has been isolated from human sputum and is probably an opportunistic pathogen. Here we de-scribe the features of this organism, together with the complete genome sequence and anno-tation. This is the first completed genome sequence of the family Segniliparaceae. The 3,157,527 bp long genome with its 3,081 protein-coding and 52 RNA genes is part of the Genomic Encyclopedia of Bacteriamore » and Archaea project.« less

  1. Image sequence analysis workstation for multipoint motion analysis

    NASA Astrophysics Data System (ADS)

    Mostafavi, Hassan

    1990-08-01

    This paper describes an application-specific engineering workstation designed and developed to analyze motion of objects from video sequences. The system combines the software and hardware environment of a modem graphic-oriented workstation with the digital image acquisition, processing and display techniques. In addition to automation and Increase In throughput of data reduction tasks, the objective of the system Is to provide less invasive methods of measurement by offering the ability to track objects that are more complex than reflective markers. Grey level Image processing and spatial/temporal adaptation of the processing parameters is used for location and tracking of more complex features of objects under uncontrolled lighting and background conditions. The applications of such an automated and noninvasive measurement tool include analysis of the trajectory and attitude of rigid bodies such as human limbs, robots, aircraft in flight, etc. The system's key features are: 1) Acquisition and storage of Image sequences by digitizing and storing real-time video; 2) computer-controlled movie loop playback, freeze frame display, and digital Image enhancement; 3) multiple leading edge tracking in addition to object centroids at up to 60 fields per second from both live input video or a stored Image sequence; 4) model-based estimation and tracking of the six degrees of freedom of a rigid body: 5) field-of-view and spatial calibration: 6) Image sequence and measurement data base management; and 7) offline analysis software for trajectory plotting and statistical analysis.

  2. Nonspatial Sequence Coding in CA1 Neurons

    PubMed Central

    Allen, Timothy A.; Salz, Daniel M.; McKenzie, Sam

    2016-01-01

    The hippocampus is critical to the memory for sequences of events, a defining feature of episodic memory. However, the fundamental neuronal mechanisms underlying this capacity remain elusive. While considerable research indicates hippocampal neurons can represent sequences of locations, direct evidence of coding for the memory of sequential relationships among nonspatial events remains lacking. To address this important issue, we recorded neural activity in CA1 as rats performed a hippocampus-dependent sequence-memory task. Briefly, the task involves the presentation of repeated sequences of odors at a single port and requires rats to identify each item as “in sequence” or “out of sequence”. We report that, while the animals' location and behavior remained constant, hippocampal activity differed depending on the temporal context of items—in this case, whether they were presented in or out of sequence. Some neurons showed this effect across items or sequence positions (general sequence cells), while others exhibited selectivity for specific conjunctions of item and sequence position information (conjunctive sequence cells) or for specific probe types (probe-specific sequence cells). We also found that the temporal context of individual trials could be accurately decoded from the activity of neuronal ensembles, that sequence coding at the single-cell and ensemble level was linked to sequence memory performance, and that slow-gamma oscillations (20–40 Hz) were more strongly modulated by temporal context and performance than theta oscillations (4–12 Hz). These findings provide compelling evidence that sequence coding extends beyond the domain of spatial trajectories and is thus a fundamental function of the hippocampus. SIGNIFICANCE STATEMENT The ability to remember the order of life events depends on the hippocampus, but the underlying neural mechanisms remain poorly understood. Here we addressed this issue by recording neural activity in hippocampal region CA1 while rats performed a nonspatial sequence memory task. We found that hippocampal neurons code for the temporal context of items (whether odors were presented in the correct or incorrect sequential position) and that this activity is linked with memory performance. The discovery of this novel form of temporal coding in hippocampal neurons advances our fundamental understanding of the neurobiology of episodic memory and will serve as a foundation for our cross-species, multitechnique approach aimed at elucidating the neural mechanisms underlying memory impairments in aging and dementia. PMID:26843637

  3. Distinct position-specific sequence features of hexa-peptides that form amyloid-fibrils: application to discriminate between amyloid fibril and amorphous β-aggregate forming peptide sequences

    PubMed Central

    2013-01-01

    Background Comparison of short peptides which form amyloid-fibrils with their homologues that may form amorphous β-aggregates but not fibrils, can aid development of novel amyloid-containing nanomaterials with well defined morphologies and characteristics. The knowledge gained from the comparative analysis could also be applied towards identifying potential aggregation prone regions in proteins, which are important for biotechnology applications or have been implicated in neurodegenerative diseases. In this work we have systematically analyzed a set of 139 amyloid-fibril hexa-peptides along with a highly homologous set of 168 hexa-peptides that do not form amyloid fibrils for their position-wise as well as overall amino acid compositions and averages of 49 selected amino acid properties. Results Amyloid-fibril forming peptides show distinct preferences and avoidances for amino acid residues to occur at each of the six positions. As expected, the amyloid fibril peptides are also more hydrophobic than non-amyloid peptides. We have used the results of this analysis to develop statistical potential energy values for the 20 amino acid residues to occur at each of the six different positions in the hexa-peptides. The distribution of the potential energy values in 139 amyloid and 168 non-amyloid fibrils are distinct and the amyloid-fibril peptides tend to be more stable (lower total potential energy values) than non-amyloid peptides. The average frequency of occurrence of these peptides with lower than specific cutoff energies at different positions is 72% and 50%, respectively. The potential energy values were used to devise a statistical discriminator to distinguish between amyloid-fibril and non-amyloid peptides. Our method could identify the amyloid-fibril forming hexa-peptides to an accuracy of 89%. On the other hand, the accuracy of identifying non-amyloid peptides was only 54%. Further attempts were made to improve the prediction accuracy via machine learning. This resulted in an overall accuracy of 82.7% with the sensitivity and specificity of 81.3% and 83.9%, respectively, in 10-fold cross-validation method. Conclusions Amyloid-fibril forming hexa-peptides show position specific sequence features that are different from those which may form amorphous β-aggregates. These positional preferences are found to be important features for discriminating amyloid-fibril forming peptides from their homologues that don't form amyloid-fibrils. PMID:23815227

  4. First rapid eye movement sleep periods and sleep-onset rapid eye movement periods in sleep-stage sequencing of hypersomnias.

    PubMed

    Drakatos, Panagis; Kosky, Christopher A; Higgins, Sean E; Muza, Rexford T; Williams, Adrian J; Leschziner, Guy D

    2013-09-01

    Discrimination between narcolepsy, idiopathic hypersomnia, and behavior-induced inadequate sleep syndrome (BIISS) is based on clinical features and on specific nocturnal polysomnography (NPSG) and multiple sleep latency test (MSLT) results. However, previous studies have cast doubt on the specificity and sensitivity of these diagnostic tools. Eleven variables of the NPSG were analyzed in 101 patients who were retrospectively diagnosed with narcolepsy with cataplexy (N+C) (n=24), narcolepsy without cataplexy (N-C) (n=38), idiopathic hypersomnia with long sleep period (IHL) (n=21), and BIISS (n=18). Fifteen out of 24 N+C and 8 out of 38 N-C entered the first rapid eye movement (REM) sleep period (FREMP) from sleep stage 1 (N1) or wake (W), though this sleep-stage sequence did not arise in the other patient groups. FREMP stage sequence was a function of REM sleep latency (REML) for both N+C and N-C groups. FREMP stage sequence was not associated with mean sleep latency (MSL) in N+C but was associated in N-C, which implies heterogeneity within the N-C group. REML also was a useful discriminator. Depending on the cutoff period, REML had a sensitivity and specificity of up to 85.5% and 97.4%, respectively. The FREMP stage sequence may be a useful tool in the diagnosis of narcolepsy, particularly in conjunction with sleep-stage sequence analysis of sleep-onset REM periods (SOREMPs) in the MSLT; it also may provide a helpful intermediate phenotype in the clarification of heterogeneity in the N-C diagnostic group. However, larger prospective studies are necessary to confirm these findings. Copyright © 2013 Elsevier B.V. All rights reserved.

  5. Precision of working memory for visual motion sequences and transparent motion surfaces

    PubMed Central

    Zokaei, Nahid; Gorgoraptis, Nikos; Bahrami, Bahador; Bays, Paul M; Husain, Masud

    2012-01-01

    Recent studies investigating working memory for location, colour and orientation support a dynamic resource model. We examined whether this might also apply to motion, using random dot kinematograms (RDKs) presented sequentially or simultaneously. Mean precision for motion direction declined as sequence length increased, with precision being lower for earlier RDKs. Two alternative models of working memory were compared specifically to distinguish between the contributions of different sources of error that corrupt memory (Zhang & Luck (2008) vs. Bays et al (2009)). The latter provided a significantly better fit for the data, revealing that decrease in memory precision for earlier items is explained by an increase in interference from other items in a sequence, rather than random guessing or a temporal decay of information. Misbinding feature attributes is an important source of error in working memory. Precision of memory for motion direction decreased when two RDKs were presented simultaneously as transparent surfaces, compared to sequential RDKs. However, precision was enhanced when one motion surface was prioritized, demonstrating that selective attention can improve recall precision. These results are consistent with a resource model that can be used as a general conceptual framework for understanding working memory across a range of visual features. PMID:22135378

  6. The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra.

    PubMed

    Shilov, Ignat V; Seymour, Sean L; Patel, Alpesh A; Loboda, Alex; Tang, Wilfred H; Keating, Sean P; Hunter, Christie L; Nuwaysir, Lydia M; Schaeffer, Daniel A

    2007-09-01

    The Paragon Algorithm, a novel database search engine for the identification of peptides from tandem mass spectrometry data, is presented. Sequence Temperature Values are computed using a sequence tag algorithm, allowing the degree of implication by an MS/MS spectrum of each region of a database to be determined on a continuum. Counter to conventional approaches, features such as modifications, substitutions, and cleavage events are modeled with probabilities rather than by discrete user-controlled settings to consider or not consider a feature. The use of feature probabilities in conjunction with Sequence Temperature Values allows for a very large increase in the effective search space with only a very small increase in the actual number of hypotheses that must be scored. The algorithm has a new kind of user interface that removes the user expertise requirement, presenting control settings in the language of the laboratory that are translated to optimal algorithmic settings. To validate this new algorithm, a comparison with Mascot is presented for a series of analogous searches to explore the relative impact of increasing search space probed with Mascot by relaxing the tryptic digestion conformance requirements from trypsin to semitrypsin to no enzyme and with the Paragon Algorithm using its Rapid mode and Thorough mode with and without tryptic specificity. Although they performed similarly for small search space, dramatic differences were observed in large search space. With the Paragon Algorithm, hundreds of biological and artifact modifications, all possible substitutions, and all levels of conformance to the expected digestion pattern can be searched in a single search step, yet the typical cost in search time is only 2-5 times that of conventional small search space. Despite this large increase in effective search space, there is no drastic loss of discrimination that typically accompanies the exploration of large search space.

  7. IDM-PhyChm-Ens: intelligent decision-making ensemble methodology for classification of human breast cancer using physicochemical properties of amino acids.

    PubMed

    Ali, Safdar; Majid, Abdul; Khan, Asifullah

    2014-04-01

    Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.

  8. ChIP-nexus: a novel ChIP-exo protocol for improved detection of in vivo transcription factor binding footprints

    PubMed Central

    He, Qiye; Johnston, Jeff; Zeitlinger, Julia

    2014-01-01

    Understanding how eukaryotic enhancers are bound and regulated by specific combinations of transcription factors is still a major challenge. To better map transcription factor binding genome-wide at nucleotide resolution in vivo, we have developed a robust ChIP-exo protocol called ChIP experiments with nucleotide resolution through exonuclease, unique barcode and single ligation (ChIP-nexus), which utilizes an efficient DNA self-circularization step during library preparation. Application of ChIP-nexus to four proteins—human TBP and Drosophila NFkB, Twist and Max— demonstrates that it outperforms existing ChIP protocols in resolution and specificity, pinpoints relevant binding sites within enhancers containing multiple binding motifs and allows the analysis of in vivo binding specificities. Notably, we show that Max frequently interacts with DNA sequences next to its motif, and that this binding pattern correlates with local DNA sequence features such as DNA shape. ChIP-nexus will be broadly applicable to studying in vivo transcription factor binding specificity and its relationship to cis-regulatory changes in humans and model organisms. PMID:25751057

  9. Mitochondrial Telomeres as Molecular Markers for Identification of the Opportunistic Yeast Pathogen Candida parapsilosis

    PubMed Central

    Nosek, Jozef; Tomáška, L'ubomír; Ryčovská, Adriana; Fukuhara, Hiroshi

    2002-01-01

    Recent studies have demonstrated that a large number of organisms carry linear mitochondrial DNA molecules possessing specialized telomeric structures at their ends. Based on this specific structural feature of linear mitochondrial genomes, we have developed an approach for identification of the opportunistic yeast pathogen Candida parapsilosis. The strategy for identification of C. parapsilosis strains is based on PCR amplification of specific DNA sequences derived from the mitochondrial telomere region. This assay is complemented by immunodetection of a protein component of mitochondrial telomeres. The results demonstrate that mitochondrial telomeres represent specific molecular markers with potential applications in yeast diagnostics and taxonomy. PMID:11923346

  10. Prediction of active sites of enzymes by maximum relevance minimum redundancy (mRMR) feature selection.

    PubMed

    Gao, Yu-Fei; Li, Bi-Qing; Cai, Yu-Dong; Feng, Kai-Yan; Li, Zhan-Dong; Jiang, Yang

    2013-01-27

    Identification of catalytic residues plays a key role in understanding how enzymes work. Although numerous computational methods have been developed to predict catalytic residues and active sites, the prediction accuracy remains relatively low with high false positives. In this work, we developed a novel predictor based on the Random Forest algorithm (RF) aided by the maximum relevance minimum redundancy (mRMR) method and incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility to predict active sites of enzymes and achieved an overall accuracy of 0.885687 and MCC of 0.689226 on an independent test dataset. Feature analysis showed that every category of the features except disorder contributed to the identification of active sites. It was also shown via the site-specific feature analysis that the features derived from the active site itself contributed most to the active site determination. Our prediction method may become a useful tool for identifying the active sites and the key features identified by the paper may provide valuable insights into the mechanism of catalysis.

  11. A better sequence-read simulator program for metagenomics.

    PubMed

    Johnson, Stephen; Trost, Brett; Long, Jeffrey R; Pittet, Vanessa; Kusalik, Anthony

    2014-01-01

    There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data. We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.

  12. Bivalve-specific gene expansion in the pearl oyster genome: implications of adaptation to a sessile lifestyle.

    PubMed

    Takeuchi, Takeshi; Koyanagi, Ryo; Gyoja, Fuki; Kanda, Miyuki; Hisata, Kanako; Fujie, Manabu; Goto, Hiroki; Yamasaki, Shinichi; Nagai, Kiyohito; Morino, Yoshiaki; Miyamoto, Hiroshi; Endo, Kazuyoshi; Endo, Hirotoshi; Nagasawa, Hiromichi; Kinoshita, Shigeharu; Asakawa, Shuichi; Watabe, Shugo; Satoh, Noriyuki; Kawashima, Takeshi

    2016-01-01

    Bivalve molluscs have flourished in marine environments, and many species constitute important aquatic resources. Recently, whole genome sequences from two bivalves, the pearl oyster, Pinctada fucata, and the Pacific oyster, Crassostrea gigas, have been decoded, making it possible to compare genomic sequences among molluscs, and to explore general and lineage-specific genetic features and trends in bivalves. In order to improve the quality of sequence data for these purposes, we have updated the entire P. fucata genome assembly. We present a new genome assembly of the pearl oyster, Pinctada fucata (version 2.0). To update the assembly, we conducted additional sequencing, obtaining accumulated sequence data amounting to 193× the P. fucata genome. Sequence redundancy in contigs that was caused by heterozygosity was removed in silico, which significantly improved subsequent scaffolding. Gene model version 2.0 was generated with the aid of manual gene annotations supplied by the P. fucata research community. Comparison of mollusc and other bilaterian genomes shows that gene arrangements of Hox, ParaHox, and Wnt clusters in the P. fucata genome are similar to those of other molluscs. Like the Pacific oyster, P. fucata possesses many genes involved in environmental responses and in immune defense. Phylogenetic analyses of heat shock protein70 and C1q domain-containing protein families indicate that extensive expansion of genes occurred independently in each lineage. Several gene duplication events prior to the split between the pearl oyster and the Pacific oyster are also evident. In addition, a number of tandem duplications of genes that encode shell matrix proteins are also well characterized in the P. fucata genome. Both the Pinctada and Crassostrea lineages have expanded specific gene families in a lineage-specific manner. Frequent duplication of genes responsible for shell formation in the P. fucata genome explains the diversity of mollusc shell structures. These duplications reveal dynamic genome evolution to forge the complex physiology that enables bivalves to employ a sessile lifestyle in the intertidal zone.

  13. repRNA: a web server for generating various feature vectors of RNA sequences.

    PubMed

    Liu, Bin; Liu, Fule; Fang, Longyun; Wang, Xiaolong; Chou, Kuo-Chen

    2016-02-01

    With the rapid growth of RNA sequences generated in the postgenomic age, it is highly desired to develop a flexible method that can generate various kinds of vectors to represent these sequences by focusing on their different features. This is because nearly all the existing machine-learning methods, such as SVM (support vector machine) and KNN (k-nearest neighbor), can only handle vectors but not sequences. To meet the increasing demands and speed up the genome analyses, we have developed a new web server, called "representations of RNA sequences" (repRNA). Compared with the existing methods, repRNA is much more comprehensive, flexible and powerful, as reflected by the following facts: (1) it can generate 11 different modes of feature vectors for users to choose according to their investigation purposes; (2) it allows users to select the features from 22 built-in physicochemical properties and even those defined by users' own; (3) the resultant feature vectors and the secondary structures of the corresponding RNA sequences can be visualized. The repRNA web server is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/repRNA/ .

  14. Phase-Specific Vocalizations of Male Mice at the Initial Encounter during the Courtship Sequence

    PubMed Central

    Matsumoto, Yui K.; Okanoya, Kazuo

    2016-01-01

    Mice produce ultrasonic vocalizations featuring a variety of syllables. Vocalizations are observed during social interactions. In particular, males produce numerous syllables during courtship. Previous studies have shown that vocalizations change according to sexual behavior, suggesting that males vary their vocalizations depending on the phase of the courtship sequence. To examine this process, we recorded large sets of mouse vocalizations during male–female interactions and acoustically categorized these sounds into 12 vocal types. We found that males emitted predominantly short syllables during the first minute of interaction, more long syllables in the later phases, and mainly harmonic sounds during mounting. These context- and time-dependent changes in vocalization indicate that vocal communication during courtship in mice consists of at least three stages and imply that each vocalization type has a specific role in a phase of the courtship sequence. Our findings suggest that recording for a sufficiently long time and taking the phase of courtship into consideration could provide more insights into the role of vocalization in mouse courtship behavior in future study. PMID:26841117

  15. The Thiamine-Pyrophosphate-Motif

    NASA Technical Reports Server (NTRS)

    Ciszak, Ewa; Dominiak, Paulina

    2004-01-01

    Thiamin pyrophosphate (TPP), a derivative of vitamin B1, is a cofactor for enzymes performing catalysis in pathways of energy production including the well known decarboxylation of a-keto acid dehydrogenases followed by transketolation. TPP-dependent enzymes constitute a structurally and functionally diverse group exhibiting multimeric subunit organization, multiple domains and two chemically equivalent catalytic centers. Annotation of functional TPP-dependcnt enzymes, therefore, has not been trivial due to low sequence similarity related to this complex organization. Our approach to analysis of structures of known TPP-dependent enzymes reveals for the first time features common to this group, which we have termed the TPP-motif. The TPP-motif consists of specific spatial arrangements of structural elements and their specific contacts to provide for a flip-flop, or alternate site, enzymatic mechanism of action. Analysis of structural elements entrained in the flip-flop action displayed by TPP-dependent enzymes reveals a novel definition of the common amino acid sequences. These sequences allow for annotation of TPP-dependent enzymes, thus advancing functional proteomics. Further details of three-dimensional structures of TPP-dependent enzymes will be discussed.

  16. A genomic view of food-related and probiotic Enterococcus strains

    PubMed Central

    Suárez, Nadia; Hormigo, Ricardo; Fadda, Silvina; Saavedra, Lucila

    2017-01-01

    Abstract The study of enterococcal genomes has grown considerably in recent years. While special attention is paid to comparative genomic analysis among clinical relevant isolates, in this study we performed an exhaustive comparative analysis of enterococcal genomes of food origin and/or with potential to be used as probiotics. Beyond common genetic features, we especially aimed to identify those that are specific to enterococcal strains isolated from a certain food-related source as well as features present in a species-specific manner. Thus, the genome sequences of 25 Enterococcus strains, from 7 different species, were examined and compared. Their phylogenetic relationship was reconstructed based on orthologous proteins and whole genomes. Likewise, markers associated with a successful colonization (bacteriocin genes and genomic islands) and genome plasticity (phages and clustered regularly interspaced short palindromic repeats) were investigated for lifestyle specific genetic features. At the same time, a search for antibiotic resistance genes was carried out, since they are of big concern in the food industry. Finally, it was possible to locate 1617 FIGfam families as a core proteome universally present among the genera and to determine that most of the accessory genes code for hypothetical proteins, providing reasonable hints to support their functional characterization. PMID:27773878

  17. MatureP: prediction of secreted proteins with exclusive information from their mature regions.

    PubMed

    Orfanoudaki, Georgia; Markaki, Maria; Chatzi, Katerina; Tsamardinos, Ioannis; Economou, Anastassios

    2017-06-12

    More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.

  18. Efficient moving target analysis for inverse synthetic aperture radar images via joint speeded-up robust features and regular moment

    NASA Astrophysics Data System (ADS)

    Yang, Hongxin; Su, Fulin

    2018-01-01

    We propose a moving target analysis algorithm using speeded-up robust features (SURF) and regular moment in inverse synthetic aperture radar (ISAR) image sequences. In our study, we first extract interest points from ISAR image sequences by SURF. Different from traditional feature point extraction methods, SURF-based feature points are invariant to scattering intensity, target rotation, and image size. Then, we employ a bilateral feature registering model to match these feature points. The feature registering scheme can not only search the isotropic feature points to link the image sequences but also reduce the error matching pairs. After that, the target centroid is detected by regular moment. Consequently, a cost function based on correlation coefficient is adopted to analyze the motion information. Experimental results based on simulated and real data validate the effectiveness and practicability of the proposed method.

  19. Combining peptide recognition specificity and context information for the prediction of the 14-3-3-mediated interactome in S. cerevisiae and H. sapiens.

    PubMed

    Panni, Simona; Montecchi-Palazzi, Luisa; Kiemer, Lars; Cabibbo, Andrea; Paoluzi, Serena; Santonico, Elena; Landgraf, Christiane; Volkmer-Engert, Rudolf; Bachi, Angela; Castagnoli, Luisa; Cesareni, Gianni

    2011-01-01

    Large-scale interaction studies contribute the largest fraction of protein interactions information in databases. However, co-purification of non-specific or indirect ligands, often results in data sets that are affected by a considerable number of false positives. For the fraction of interactions mediated by short linear peptides, we present here a combined experimental and computational strategy for ranking the reliability of the inferred partners. We apply this strategy to the family of 14-3-3 domains. We have first characterized the recognition specificity of this domain family, largely confirming the results of previous analyses, while revealing new features of the preferred sequence context of 14-3-3 phospho-peptide partners. Notably, a proline next to the carboxy side of the phospho-amino acid functions as a potent inhibitor of 14-3-3 binding. The position-specific information about residue preference was encoded in a scoring matrix and two regular expressions. The integration of these three features in a single predictive model outperforms publicly available prediction tools. Next we have combined, by a naïve Bayesian approach, these "peptide features" with "protein features", such as protein co-expression and co-localization. Our approach provides an orthogonal reliability assessment and maps with high confidence the 14-3-3 peptide target on the partner proteins. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  20. Limited number of immunoglobulin VH regions expressed in the mutant rabbit "Alicia".

    PubMed

    DiPietro, L A; Short, J A; Zhai, S K; Kelus, A S; Meier, D; Knight, K L

    1990-06-01

    A unique feature of rabbit Ig is the presence of VH region allotypic specificities. In normal rabbits, more than 80% of circulating immunoglobulin molecules bear the VHa allotypic specificities, al, a2 or a3; the remaining 10% to 20% of immunoglobulin molecules lack VHa allotypic specificities and are designated VHa-. A mutant rabbit designated Alicia, in contrast, has predominantly serum immunoglobulin molecules that lack the VHa allotypic specificities (Kelus and Weiss, Proc. Natl. Acad. Sci. USA 1986. 83: 4883). To study the nature and molecular complexity of VHa- molecules, we cloned and determined the nucleotide sequence of seven cDNA prepared from splenic RNA of an Alicia rabbit. Six of the clones appeared to encode VHa- molecules; the framework regions encoded by these clones were remarkably similar to each other, each having an unusual insertion of four amino acids at position 10. This insertion of four amino acids has been seen in only 2 of 54 sequenced rabbit VH genes. The similarity of the sequences of the six VHa- clones to each other and their dissimilarity to most other VH genes leads us to suggest that the VHa- molecules in Alicia rabbits are derived predominantly from one or a small number of very similar VH genes. Such preferential utilization of a small number of VH genes may explain the allelic inheritance of VH allotypes.

  1. Role of the chromatin landscape and sequence in determining cell type-specific genomic glucocorticoid receptor binding and gene regulation.

    PubMed

    Love, Michael I; Huska, Matthew R; Jurk, Marcel; Schöpflin, Robert; Starick, Stephan R; Schwahn, Kevin; Cooper, Samantha B; Yamamoto, Keith R; Thomas-Chollier, Morgane; Vingron, Martin; Meijsing, Sebastiaan H

    2017-02-28

    The genomic loci bound by the glucocorticoid receptor (GR), a hormone-activated transcription factor, show little overlap between cell types. To study the role of chromatin and sequence in specifying where GR binds, we used Bayesian modeling within the universe of accessible chromatin. Taken together, our results uncovered that although GR preferentially binds accessible chromatin, its binding is biased against accessible chromatin located at promoter regions. This bias can only be explained partially by the presence of fewer GR recognition sequences, arguing for the existence of additional mechanisms that interfere with GR binding at promoters. Therefore, we tested the role of H3K9ac, the chromatin feature with the strongest negative association with GR binding, but found that this correlation does not reflect a causative link. Finally, we find a higher percentage of promoter-proximal GR binding for genes regulated by GR across cell types than for cell type-specific target genes. Given that GR almost exclusively binds accessible chromatin, we propose that cell type-specific regulation by GR preferentially occurs via distal enhancers, whose chromatin accessibility is typically cell type-specific, whereas ubiquitous target gene regulation is more likely to result from binding to promoter regions, which are often accessible regardless of cell type examined. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  2. A Method for WD40 Repeat Detection and Secondary Structure Prediction

    PubMed Central

    Wang, Yang; Jiang, Fan; Zhuo, Zhu; Wu, Xian-Hui; Wu, Yun-Dong

    2013-01-01

    WD40-repeat proteins (WD40s), as one of the largest protein families in eukaryotes, play vital roles in assembling protein-protein/DNA/RNA complexes. WD40s fold into similar β-propeller structures despite diversified sequences. A program WDSP (WD40 repeat protein Structure Predictor) has been developed to accurately identify WD40 repeats and predict their secondary structures. The method is designed specifically for WD40 proteins by incorporating both local residue information and non-local family-specific structural features. It overcomes the problem of highly diversified protein sequences and variable loops. In addition, WDSP achieves a better prediction in identifying multiple WD40-domain proteins by taking the global combination of repeats into consideration. In secondary structure prediction, the average Q3 accuracy of WDSP in jack-knife test reaches 93.7%. A disease related protein LRRK2 was used as a representive example to demonstrate the structure prediction. PMID:23776530

  3. Quantitative profiling of immune repertoires for minor lymphocyte counts using unique molecular identifiers.

    PubMed

    Egorov, Evgeny S; Merzlyak, Ekaterina M; Shelenkov, Andrew A; Britanova, Olga V; Sharonov, George V; Staroverov, Dmitriy B; Bolotin, Dmitriy A; Davydov, Alexey N; Barsova, Ekaterina; Lebedev, Yuriy B; Shugay, Mikhail; Chudakov, Dmitriy M

    2015-06-15

    Emerging high-throughput sequencing methods for the analyses of complex structure of TCR and BCR repertoires give a powerful impulse to adaptive immunity studies. However, there are still essential technical obstacles for performing a truly quantitative analysis. Specifically, it remains challenging to obtain comprehensive information on the clonal composition of small lymphocyte populations, such as Ag-specific, functional, or tissue-resident cell subsets isolated by sorting, microdissection, or fine needle aspirates. In this study, we report a robust approach based on unique molecular identifiers that allows profiling Ag receptors for several hundred to thousand lymphocytes while preserving qualitative and quantitative information on clonal composition of the sample. We also describe several general features regarding the data analysis with unique molecular identifiers that are critical for accurate counting of starting molecules in high-throughput sequencing applications. Copyright © 2015 by The American Association of Immunologists, Inc.

  4. Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity.

    PubMed

    Camproux, A C; Tufféry, P

    2005-08-05

    Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them. We have recently set up a Hidden Markov Model to optimally compress protein three-dimensional conformations into a one-dimensional series of letters of a structural alphabet. Such a model learns simultaneously the shape of representative structural letters describing the local conformation and the logic of their connections, i.e. the transition matrix between the letters. Here, we move one step further and report some evidence that such a model of protein local architecture also captures some accurate amino acid features. All the letters have specific and distinct amino acid distributions. Moreover, we show that words of amino acids can have significant propensities for some letters. Perspectives point towards the prediction of the series of letters describing the structure of a protein from its amino acid sequence.

  5. Outcomes of Diagnostic Exome Sequencing in Patients With Diagnosed or Suspected Autism Spectrum Disorders.

    PubMed

    Rossi, Mari; El-Khechen, Dima; Black, Mary Helen; Farwell Hagman, Kelly D; Tang, Sha; Powis, Zöe

    2017-05-01

    Exome sequencing has recently been proved to be a successful diagnostic method for complex neurodevelopmental disorders. However, the diagnostic yield of exome sequencing for autism spectrum disorders has not been extensively evaluated in large cohorts to date. We performed diagnostic exome sequencing in a cohort of 163 individuals with autism spectrum disorder (66.3%) or autistic features (33.7%). The diagnostic yield observed in patients in our cohort was 25.8% (42 of 163) for positive or likely positive findings in characterized disease genes, while a candidate genetic etiology was reported for an additional 3.3% (4 of 120) of patients. Among the positive findings in the patients with autism spectrum disorder or autistic features, 61.9% were the result of de novo mutations. Patients presenting with psychiatric conditions or ataxia or paraplegia in addition to autism spectrum disorder or autistic features were significantly more likely to receive positive results compared with patients without these clinical features (95.6% vs 27.1%, P < 0.0001; 83.3% vs 21.2%, P < 0.0001, respectively). The majority of the positive findings were in recently identified autism spectrum disorder genes, supporting the importance of diagnostic exome sequencing for patients with autism spectrum disorder or autistic features as the causative genes might evade traditional sequential or panel testing. These results suggest that diagnostic exome sequencing would be an efficient primary diagnostic method for patients with autism spectrum disorders or autistic features. Moreover, our data may aid clinicians to better determine which subset of patients with autism spectrum disorder with additional clinical features would benefit the most from diagnostic exome sequencing. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  6. Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach

    PubMed Central

    Meyer, Pablo; Siwo, Geoffrey; Zeevi, Danny; Sharon, Eilon; Norel, Raquel; Segal, Eran; Stolovitzky, Gustavo; Siwo, Geoffrey; Rider, Andrew K.; Tan, Asako; Pinapati, Richard S.; Emrich, Scott; Chawla, Nitesh; Ferdig, Michael T.; Tung, Yi-An; Chen, Yong-Syuan; Chen, Mei-Ju May; Chen, Chien-Yu; Knight, Jason M.; Sahraeian, Sayed Mohammad Ebrahim; Esfahani, Mohammad Shahrokh; Dreos, Rene; Bucher, Philipp; Maier, Ezekiel; Saeys, Yvan; Szczurek, Ewa; Myšičková, Alena; Vingron, Martin; Klein, Holger; Kiełbasa, Szymon M.; Knisley, Jeff; Bonnell, Jeff; Knisley, Debra; Kursa, Miron B.; Rudnicki, Witold R.; Bhattacharjee, Madhuchhanda; Sillanpää, Mikko J.; Yeung, James; Meysman, Pieter; Rodríguez, Aminael Sánchez; Engelen, Kristof; Marchal, Kathleen; Huang, Yezhou; Mordelet, Fantine; Hartemink, Alexander; Pinello, Luca; Yuan, Guo-Cheng

    2013-01-01

    The Gene Promoter Expression Prediction challenge consisted of predicting gene expression from promoter sequences in a previously unknown experimentally generated data set. The challenge was presented to the community in the framework of the sixth Dialogue for Reverse Engineering Assessments and Methods (DREAM6), a community effort to evaluate the status of systems biology modeling methodologies. Nucleotide-specific promoter activity was obtained by measuring fluorescence from promoter sequences fused upstream of a gene for yellow fluorescence protein and inserted in the same genomic site of yeast Saccharomyces cerevisiae. Twenty-one teams submitted results predicting the expression levels of 53 different promoters from yeast ribosomal protein genes. Analysis of participant predictions shows that accurate values for low-expressed and mutated promoters were difficult to obtain, although in the latter case, only when the mutation induced a large change in promoter activity compared to the wild-type sequence. As in previous DREAM challenges, we found that aggregation of participant predictions provided robust results, but did not fare better than the three best algorithms. Finally, this study not only provides a benchmark for the assessment of methods predicting activity of a specific set of promoters from their sequence, but it also shows that the top performing algorithm, which used machine-learning approaches, can be improved by the addition of biological features such as transcription factor binding sites. PMID:23950146

  7. STITCHER: A web resource for high-throughput design of primers for overlapping PCR applications.

    PubMed

    O'Halloran, Damien M

    2015-06-01

    Overlapping PCR is routinely used in a wide number of molecular applications. These include stitching PCR fragments together, generating fluorescent transcriptional and translational fusions, inserting mutations, making deletions, and PCR cloning. Overlapping PCR is also used for genotyping by traditional PCR techniques and in detection experiments using techniques such as loop-mediated isothermal amplification (LAMP). STITCHER is a web tool providing a central resource for researchers conducting all types of overlapping PCR experiments with an intuitive interface for automated primer design that's fast, easy to use, and freely available online (http://ohalloranlab.net/STITCHER.html). STITCHER can handle both single sequence and multi-sequence input, and specific features facilitate numerous other PCR applications, including assembly PCR, adapter PCR, and primer walking. Field PCR, and in particular, LAMP, offers promise as an on site tool for pathogen detection in underdeveloped areas, and STITCHER includes off-target detection features for pathogens commonly targeted using LAMP technology.

  8. Protein location prediction using atomic composition and global features of the amino acid sequence

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cherian, Betsy Sheena, E-mail: betsy.skb@gmail.com; Nair, Achuthsankar S.

    2010-01-22

    Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectivelymore » used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy.« less

  9. Subtlety of Ambient-Language Effects in Babbling: A Study of English- and Chinese-Learning Infants at 8, 10, and 12 Months

    PubMed Central

    Lee, Chia-Cheng; Jhang, Yuna; Chen, Li-mei; Relyea, George; Oller, D. Kimbrough

    2016-01-01

    Prior research on ambient-language effects in babbling has often suggested infants produce language-specific phonological features within the first year. These results have been questioned in research failing to find such effects and challenging the positive findings on methodological grounds. We studied English- and Chinese-learning infants at 8, 10, and 12 months and found listeners could not detect ambient-language effects in the vast majority of infant utterances, but only in items deemed to be words or to contain canonical syllables that may have made them sound like words with language-specific shapes. Thus, the present research suggests the earliest ambient-language effects may be found in emerging lexical items or in utterances influenced by language-specific features of lexical items. Even the ambient-language effects for infant canonical syllables and words were very small compared with ambient-language effects for meaningless but phonotactically well-formed syllable sequences spoken by adult native speakers of English and Chinese. PMID:28496393

  10. Fold assessment for comparative protein structure modeling.

    PubMed

    Melo, Francisco; Sali, Andrej

    2007-11-01

    Accurate and automated assessment of both geometrical errors and incompleteness of comparative protein structure models is necessary for an adequate use of the models. Here, we describe a composite score for discriminating between models with the correct and incorrect fold. To find an accurate composite score, we designed and applied a genetic algorithm method that searched for a most informative subset of 21 input model features as well as their optimized nonlinear transformation into the composite score. The 21 input features included various statistical potential scores, stereochemistry quality descriptors, sequence alignment scores, geometrical descriptors, and measures of protein packing. The optimized composite score was found to depend on (1) a statistical potential z-score for residue accessibilities and distances, (2) model compactness, and (3) percentage sequence identity of the alignment used to build the model. The accuracy of the composite score was compared with the accuracy of assessment by single and combined features as well as by other commonly used assessment methods. The testing set was representative of models produced by automated comparative modeling on a genomic scale. The composite score performed better than any other tested score in terms of the maximum correct classification rate (i.e., 3.3% false positives and 2.5% false negatives) as well as the sensitivity and specificity across the whole range of thresholds. The composite score was implemented in our program MODELLER-8 and was used to assess models in the MODBASE database that contains comparative models for domains in approximately 1.3 million protein sequences.

  11. Differential correlation for sequencing data.

    PubMed

    Siska, Charlotte; Kechris, Katerina

    2017-01-19

    Several methods have been developed to identify differential correlation (DC) between pairs of molecular features from -omics studies. Most DC methods have only been tested with microarrays and other platforms producing continuous and Gaussian-like data. Sequencing data is in the form of counts, often modeled with a negative binomial distribution making it difficult to apply standard correlation metrics. We have developed an R package for identifying DC called Discordant which uses mixture models for correlations between features and the Expectation Maximization (EM) algorithm for fitting parameters of the mixture model. Several correlation metrics for sequencing data are provided and tested using simulations. Other extensions in the Discordant package include additional modeling for different types of differential correlation, and faster implementation, using a subsampling routine to reduce run-time and address the assumption of independence between molecular feature pairs. With simulations and breast cancer miRNA-Seq and RNA-Seq data, we find that Spearman's correlation has the best performance among the tested correlation methods for identifying differential correlation. Application of Spearman's correlation in the Discordant method demonstrated the most power in ROC curves and sensitivity/specificity plots, and improved ability to identify experimentally validated breast cancer miRNA. We also considered including additional types of differential correlation, which showed a slight reduction in power due to the additional parameters that need to be estimated, but more versatility in applications. Finally, subsampling within the EM algorithm considerably decreased run-time with negligible effect on performance. A new method and R package called Discordant is presented for identifying differential correlation with sequencing data. Based on comparisons with different correlation metrics, this study suggests Spearman's correlation is appropriate for sequencing data, but other correlation metrics are available to the user depending on the application and data type. The Discordant method can also be extended to investigate additional DC types and subsampling with the EM algorithm is now available for reduced run-time. These extensions to the R package make Discordant more robust and versatile for multiple -omics studies.

  12. Spatiotemporal Patterns of Contact Across the Rat Vibrissal Array During Exploratory Behavior

    PubMed Central

    Hobbs, Jennifer A.; Towal, R. Blythe; Hartmann, Mitra J. Z.

    2016-01-01

    The rat vibrissal system is an important model for the study of somatosensation, but the small size and rapid speed of the vibrissae have precluded measuring precise vibrissal-object contact sequences during behavior. We used a laser light sheet to quantify, with 1 ms resolution, the spatiotemporal structure of whisker-surface contact as five naïve rats freely explored a flat, vertical glass wall. Consistent with previous work, we show that the whisk cycle cannot be uniquely defined because different whiskers often move asynchronously, but that quasi-periodic (~8 Hz) variations in head velocity represent a distinct temporal feature on which to lock analysis. Around times of minimum head velocity, whiskers protract to make contact with the surface, and then sustain contact with the surface for extended durations (~25–60 ms) before detaching. This behavior results in discrete temporal windows in which large numbers of whiskers are in contact with the surface. These “sustained collective contact intervals” (SCCIs) were observed on 100% of whisks for all five rats. The overall spatiotemporal structure of the SCCIs can be qualitatively predicted based on information about head pose and the average whisk cycle. In contrast, precise sequences of whisker-surface contact depend on detailed head and whisker kinematics. Sequences of vibrissal contact were highly variable, equally likely to propagate in all directions across the array. Somewhat more structure was found when sequences of contacts were examined on a row-wise basis. In striking contrast to the high variability associated with contact sequences, a consistent feature of each SCCI was that the contact locations of the whiskers on the glass converged and moved more slowly on the sheet. Together, these findings lead us to propose that the rat uses a strategy of “windowed sampling” to extract an object's spatial features: specifically, the rat spatially integrates quasi-static mechanical signals across whiskers during the period of sustained contact, resembling an “enclosing” haptic procedure. PMID:26778990

  13. Automatic summarization of changes in biological image sequences using algorithmic information theory.

    PubMed

    Cohen, Andrew R; Bjornsson, Christopher S; Temple, Sally; Banker, Gary; Roysam, Badrinath

    2009-08-01

    An algorithmic information-theoretic method is presented for object-level summarization of meaningful changes in image sequences. Object extraction and tracking data are represented as an attributed tracking graph (ATG). Time courses of object states are compared using an adaptive information distance measure, aided by a closed-form multidimensional quantization. The notion of meaningful summarization is captured by using the gap statistic to estimate the randomness deficiency from algorithmic statistics. The summary is the clustering result and feature subset that maximize the gap statistic. This approach was validated on four bioimaging applications: 1) It was applied to a synthetic data set containing two populations of cells differing in the rate of growth, for which it correctly identified the two populations and the single feature out of 23 that separated them; 2) it was applied to 59 movies of three types of neuroprosthetic devices being inserted in the brain tissue at three speeds each, for which it correctly identified insertion speed as the primary factor affecting tissue strain; 3) when applied to movies of cultured neural progenitor cells, it correctly distinguished neurons from progenitors without requiring the use of a fixative stain; and 4) when analyzing intracellular molecular transport in cultured neurons undergoing axon specification, it automatically confirmed the role of kinesins in axon specification.

  14. Biochemical Regulatory Features of Activation-Induced Cytidine Deaminase Remain Conserved from Lampreys to Humans

    PubMed Central

    King, Justin J.; Amemiya, Chris T.; Hsu, Ellen

    2017-01-01

    ABSTRACT Activation-induced cytidine deaminase (AID) is a genome-mutating enzyme that initiates class switch recombination and somatic hypermutation of antibodies in jawed vertebrates. We previously described the biochemical properties of human AID and found that it is an unusual enzyme in that it exhibits binding affinities for its substrate DNA and catalytic rates several orders of magnitude higher and lower, respectively, than a typical enzyme. Recently, we solved the functional structure of AID and demonstrated that these properties are due to nonspecific DNA binding on its surface, along with a catalytic pocket that predominantly assumes a closed conformation. Here we investigated the biochemical properties of AID from a sea lamprey, nurse shark, tetraodon, and coelacanth: representative species chosen because their lineages diverged at the earliest critical junctures in evolution of adaptive immunity. We found that these earliest-diverged AID orthologs are active cytidine deaminases that exhibit unique substrate specificities and thermosensitivities. Significant amino acid sequence divergence among these AID orthologs is predicted to manifest as notable structural differences. However, despite major differences in sequence specificities, thermosensitivities, and structural features, all orthologs share the unusually high DNA binding affinities and low catalytic rates. This absolute conservation is evidence for biological significance of these unique biochemical properties. PMID:28716949

  15. Predicting Cell Association of Surface-Modified Nanoparticles Using Protein Corona Structure - Activity Relationships (PCSAR).

    PubMed

    Kamath, Padmaja; Fernandez, Alberto; Giralt, Francesc; Rallo, Robert

    2015-01-01

    Nanoparticles are likely to interact in real-case application scenarios with mixtures of proteins and biomolecules that will absorb onto their surface forming the so-called protein corona. Information related to the composition of the protein corona and net cell association was collected from literature for a library of surface-modified gold and silver nanoparticles. For each protein in the corona, sequence information was extracted and used to calculate physicochemical properties and statistical descriptors. Data cleaning and preprocessing techniques including statistical analysis and feature selection methods were applied to remove highly correlated, redundant and non-significant features. A weighting technique was applied to construct specific signatures that represent the corona composition for each nanoparticle. Using this basic set of protein descriptors, a new Protein Corona Structure-Activity Relationship (PCSAR) that relates net cell association with the physicochemical descriptors of the proteins that form the corona was developed and validated. The features that resulted from the feature selection were in line with already published literature, and the computational model constructed on these features had a good accuracy (R(2)LOO=0.76 and R(2)LMO(25%)=0.72) and stability, with the advantage that the fingerprints based on physicochemical descriptors were independent of the specific proteins that form the corona.

  16. The building blocks of a 'Liveable Neighbourhood': Identifying the key performance indicators for walking of an operational planning policy in Perth, Western Australia.

    PubMed

    Hooper, Paula; Knuiman, Matthew; Foster, Sarah; Giles-Corti, Billie

    2015-11-01

    Planning policy makers are requesting clearer guidance on the key design features required to build neighbourhoods that promote active living. Using a backwards stepwise elimination procedure (logistic regression with generalised estimating equations adjusting for demographic characteristics, self-selection factors, stage of construction and scale of development) this study identified specific design features (n=16) from an operational planning policy ("Liveable Neighbourhoods") that showed the strongest associations with walking behaviours (measured using the Neighbourhood Physical Activity Questionnaire). The interacting effects of design features on walking behaviours were also investigated. The urban design features identified were grouped into the "building blocks of a Liveable Neighbourhood", reflecting the scale, importance and sequencing of the design and implementation phases required to create walkable, pedestrian friendly developments. Copyright © 2015 Elsevier Ltd. All rights reserved.

  17. Structural and phylogenetic analysis of Rhodobacter capsulatus NifF: uncovering general features of nitrogen-fixation (nif)-flavodoxins.

    PubMed

    Pérez-Dorado, Inmaculada; Bortolotti, Ana; Cortez, Néstor; Hermoso, Juan A

    2013-01-09

    Analysis of the crystal structure of NifF from Rhodobacter capsulatus and its homologues reported so far reflects the existence of unique structural features in nif flavodoxins: a leucine at the re face of the isoalloxazine, an eight-residue insertion at the C-terminus of the 50's loop and a remarkable difference in the electrostatic potential surface with respect to non-nif flavodoxins. A phylogenetic study on 64 sequences from 52 bacterial species revealed four clusters, including different functional prototypes, correlating the previously defined as "short-chain" with the firmicutes flavodoxins and the "long-chain" with gram-negative species. The comparison of Rhodobacter NifF structure with other bacterial flavodoxin prototypes discloses the concurrence of specific features of these functional electron donors to nitrogenase.

  18. Networking Biology: The Origins of Sequence-Sharing Practices in Genomics.

    PubMed

    Stevens, Hallam

    2015-10-01

    The wide sharing of biological data, especially nucleotide sequences, is now considered to be a key feature of genomics. Historians and sociologists have attempted to account for the rise of this sharing by pointing to precedents in model organism communities and in natural history. This article supplements these approaches by examining the role that electronic networking technologies played in generating the specific forms of sharing that emerged in genomics. The links between early computer users at the Stanford Artificial Intelligence Laboratory in the 1960s, biologists using local computer networks in the 1970s, and GenBank in the 1980s, show how networking technologies carried particular practices of communication, circulation, and data distribution from computing into biology. In particular, networking practices helped to transform sequences themselves into objects that had value as a community resource.

  19. Evolution of the arginase fold and functional diversity

    PubMed Central

    Dowling, Daniel P.; Costanzo, Luigi Di; Gennadios, Heather A.; Christianson, David W.

    2009-01-01

    The large number of protein structures deposited in the Protein Data Bank allows for the identification of novel structural superfamilies based on conservation of fold in addition to conservation of amino acid sequence. Since sequence diverges more rapidly than fold in protein evolution, proteins with little or no significant sequence identity are occasionally observed to adopt similar folds, thereby reflecting unanticipated evolutionary relationships. Here, we review the unique α/β fold first observed in the manganese metalloenzyme rat liver arginase, consisting of a parallel 8 stranded β-sheet surrounded by several helices, and its evolutionary relationship with the zinc-requiring and/or iron-requiring histone deacetylases and acetylpolyamine amidohydrolases. Structural comparisons reveal key features of the core α/β fold that contribute to the divergent metal ion specificity and stoichiometry required for the chemical and biological functions of these enzymes. PMID:18360740

  20. Thermodynamics-Based Models of Transcriptional Regulation by Enhancers: The Roles of Synergistic Activation, Cooperative Binding and Short-Range Repression

    PubMed Central

    He, Xin; Samee, Md. Abul Hassan; Blatti, Charles; Sinha, Saurabh

    2010-01-01

    Quantitative models of cis-regulatory activity have the potential to improve our mechanistic understanding of transcriptional regulation. However, the few models available today have been based on simplistic assumptions about the sequences being modeled, or heuristic approximations of the underlying regulatory mechanisms. We have developed a thermodynamics-based model to predict gene expression driven by any DNA sequence, as a function of transcription factor concentrations and their DNA-binding specificities. It uses statistical thermodynamics theory to model not only protein-DNA interaction, but also the effect of DNA-bound activators and repressors on gene expression. In addition, the model incorporates mechanistic features such as synergistic effect of multiple activators, short range repression, and cooperativity in transcription factor-DNA binding, allowing us to systematically evaluate the significance of these features in the context of available expression data. Using this model on segmentation-related enhancers in Drosophila, we find that transcriptional synergy due to simultaneous action of multiple activators helps explain the data beyond what can be explained by cooperative DNA-binding alone. We find clear support for the phenomenon of short-range repression, where repressors do not directly interact with the basal transcriptional machinery. We also find that the binding sites contributing to an enhancer's function may not be conserved during evolution, and a noticeable fraction of these undergo lineage-specific changes. Our implementation of the model, called GEMSTAT, is the first publicly available program for simultaneously modeling the regulatory activities of a given set of sequences. PMID:20862354

  1. Prediction of siRNA potency using sparse logistic regression.

    PubMed

    Hu, Wei; Hu, John

    2014-06-01

    RNA interference (RNAi) can modulate gene expression at post-transcriptional as well as transcriptional levels. Short interfering RNA (siRNA) serves as a trigger for the RNAi gene inhibition mechanism, and therefore is a crucial intermediate step in RNAi. There have been extensive studies to identify the sequence characteristics of potent siRNAs. One such study built a linear model using LASSO (Least Absolute Shrinkage and Selection Operator) to measure the contribution of each siRNA sequence feature. This model is simple and interpretable, but it requires a large number of nonzero weights. We have introduced a novel technique, sparse logistic regression, to build a linear model using single-position specific nucleotide compositions which has the same prediction accuracy of the linear model based on LASSO. The weights in our new model share the same general trend as those in the previous model, but have only 25 nonzero weights out of a total 84 weights, a 54% reduction compared to the previous model. Contrary to the linear model based on LASSO, our model suggests that only a few positions are influential on the efficacy of the siRNA, which are the 5' and 3' ends and the seed region of siRNA sequences. We also employed sparse logistic regression to build a linear model using dual-position specific nucleotide compositions, a task LASSO is not able to accomplish well due to its high dimensional nature. Our results demonstrate the superiority of sparse logistic regression as a technique for both feature selection and regression over LASSO in the context of siRNA design.

  2. A Comprehensive Analysis of Transcript-Supported De Novo Genes in Saccharomyces sensu stricto Yeasts

    PubMed Central

    Lu, Tzu-Chiao; Leu, Jun-Yi; Lin, Wen-Chang

    2017-01-01

    Abstract Novel genes arising from random DNA sequences (de novo genes) have been suggested to be widespread in the genomes of different organisms. However, our knowledge about the origin and evolution of de novo genes is still limited. To systematically understand the general features of de novo genes, we established a robust pipeline to analyze >20,000 transcript-supported coding sequences (CDSs) from the budding yeast Saccharomyces cerevisiae. Our analysis pipeline combined phylogeny, synteny, and sequence alignment information to identify possible orthologs across 20 Saccharomycetaceae yeasts and discovered 4,340 S. cerevisiae-specific de novo genes and 8,871 S. sensu stricto-specific de novo genes. We further combine information on CDS positions and transcript structures to show that >65% of de novo genes arose from transcript isoforms of ancient genes, especially in the upstream and internal regions of ancient genes. Fourteen identified de novo genes with high transcript levels were chosen to verify their protein expressions. Ten of them, including eight transcript isoform-associated CDSs, showed translation signals and five proteins exhibited specific cytosolic localizations. Our results suggest that de novo genes frequently arise in the S. sensu stricto complex and have the potential to be quickly integrated into ancient cellular network. PMID:28981695

  3. Sequence organization and evolutionary dynamics of Brachypodium-specific centromere retrotransposons.

    PubMed

    Qi, L L; Wu, J J; Friebe, B; Qian, C; Gu, Y Q; Fu, D L; Gill, B S

    2013-08-01

    Brachypodium distachyon is a wild annual grass belonging to the Pooideae, more closely related to wheat, barley, and forage grasses than rice and maize. As an experimental model, the completed genome sequence of B. distachyon provides a unique opportunity to study centromere evolution during the speciation of grasses. Centromeric satellite sequences have been identified in B. distachyon, but little is known about centromeric retrotransposons in this species. In the present study, bacterial artificial chromosome (BAC)-fluorescence in situ hybridization was conducted in maize, rice, barley, wheat, and rye using B. distachyon (Bd) centromere-specific BAC clones. Eight Bd centromeric BAC clones gave no detectable fluorescence in situ hybridization (FISH) signals on the chromosomes of rice and maize, and three of them also did not yield any FISH signals in barley, wheat, and rye. In addition, four of five Triticeae centromeric BAC clones did not hybridize to the B. distachyon centromeres, implying certain unique features of Brachypodium centromeres. Analysis of Brachypodium centromeric BAC sequences identified a long terminal repeat (LTR)-centromere retrotransposon of B. distachyon (CRBd1). This element was found in high copy number accounting for 1.6 % of the B. distachyon genome, and is enriched in Brachypodium centromeric regions. CRBd1 accumulated in active centromeres, but was lost from inactive ones. The LTR of CRBd1 appears to be specific to B. distachyon centromeres. These results reveal different evolutionary events of this retrotransposon family across grass species.

  4. Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae.

    PubMed

    Zubek, Julian; Tatjewski, Marcin; Boniecki, Adam; Mnich, Maciej; Basu, Subhadip; Plewczynski, Dariusz

    2015-01-01

    Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

  5. Programmable DNA-binding proteins from Burkholderia provide a fresh perspective on the TALE-like repeat domain.

    PubMed

    de Lange, Orlando; Wolf, Christina; Dietze, Jörn; Elsaesser, Janett; Morbitzer, Robert; Lahaye, Thomas

    2014-06-01

    The tandem repeats of transcription activator like effectors (TALEs) mediate sequence-specific DNA binding using a simple code. Naturally, TALEs are injected by Xanthomonas bacteria into plant cells to manipulate the host transcriptome. In the laboratory TALE DNA binding domains are reprogrammed and used to target a fused functional domain to a genomic locus of choice. Research into the natural diversity of TALE-like proteins may provide resources for the further improvement of current TALE technology. Here we describe TALE-like proteins from the endosymbiotic bacterium Burkholderia rhizoxinica, termed Bat proteins. Bat repeat domains mediate sequence-specific DNA binding with the same code as TALEs, despite less than 40% sequence identity. We show that Bat proteins can be adapted for use as transcription factors and nucleases and that sequence preferences can be reprogrammed. Unlike TALEs, the core repeats of each Bat protein are highly polymorphic. This feature allowed us to explore alternative strategies for the design of custom Bat repeat arrays, providing novel insights into the functional relevance of non-RVD residues. The Bat proteins offer fertile grounds for research into the creation of improved programmable DNA-binding proteins and comparative insights into TALE-like evolution. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  6. Grasshopper, a long terminal repeat (LTR) retroelement in the phytopathogenic fungus Magnaporthe grisea.

    PubMed

    Dobinson, K F; Harris, R E; Hamer, J E

    1993-01-01

    The fungal phytopathogen Magnaporthe grisea parasitizes a wide variety of gramineous hosts. In the course of investigating the genetic relationship between pathogen genotype and host specificity we identified a retroelement that is present in some strains of M. grisea that infect finger millet and goosegrass (members of the plant genus Eleusine). The element, designated grasshopper (grh), is present in multiple copies and dispersed throughout the genome. DNA sequence analysis showed that grasshopper contains 198 base pair direct, long terminal repeats (LTRs) with features characteristic of retroviral and retrotransposon LTRs. Within the element we identified an open reading frame with sequences homologous to the reverse transcriptase, RNaseH, and integrase domains of retroelement pol genes. Comparison of the open reading frame with sequences from other retroelements showed that grh is related to the gypsy family of retrotransposons. Comparisons of the distribution of the grasshopper element with other dispersed repeated DNA sequences in M. grisea indicated that grasshopper was present in a broadly dispersed subgroup of Eleusine pathogens, suggesting that the element was acquired subsequent to the evolution of this host-specific form. We present arguments that the amplification of different retroelements within populations of M. grisea is a consequence of the clonal organization of the fungal populations.

  7. Frequency of the first feature in action sequences influences feature binding.

    PubMed

    Mattson, Paul S; Fournier, Lisa R; Behmer, Lawrence P

    2012-10-01

    We investigated whether binding among perception and action feature codes is a preliminary step toward creating a more durable memory trace of an action event. If so, increasing the frequency of a particular event (e.g., a stimulus requiring a movement with the left or right hand in an up or down direction) should increase the strength and speed of feature binding for this event. The results from two experiments, using a partial-repetition paradigm, confirmed that feature binding increased in strength and/or occurred earlier for a high-frequency (e.g., left hand moving up) than for a low-frequency (e.g., right hand moving down) event. Moreover, increasing the frequency of the first-specified feature in the action sequence alone (e.g., "left" hand) increased the strength and/or speed of action feature binding (e.g., between the "left" hand and movement in an "up" or "down" direction). The latter finding suggests an update to the theory of event coding, as not all features in the action sequence equally determine binding strength. We conclude that action planning involves serial binding of features in the order of action feature execution (i.e., associations among features are not bidirectional but are directional), which can lead to a more durable memory trace. This is consistent with physiological evidence suggesting that serial order is preserved in an action plan executed from memory and that the first feature in the action sequence may be critical in preserving this serial order.

  8. A multi-scale analysis of bull sperm methylome revealed both species peculiarities and conserved tissue-specific features.

    PubMed

    Perrier, Jean-Philippe; Sellem, Eli; Prézelin, Audrey; Gasselin, Maxime; Jouneau, Luc; Piumi, François; Al Adhami, Hala; Weber, Michaël; Fritz, Sébastien; Boichard, Didier; Le Danvic, Chrystelle; Schibler, Laurent; Jammes, Hélène; Kiefer, Hélène

    2018-05-29

    Spermatozoa have a remarkable epigenome in line with their degree of specialization, their unique nature and different requirements for successful fertilization. Accordingly, perturbations in the establishment of DNA methylation patterns during male germ cell differentiation have been associated with infertility in several species. While bull semen is widely used in artificial insemination, the literature describing DNA methylation in bull spermatozoa is still scarce. The purpose of this study was therefore to characterize the bull sperm methylome relative to both bovine somatic cells and the sperm of other mammals through a multiscale analysis. The quantification of DNA methylation at CCGG sites using luminometric methylation assay (LUMA) highlighted the undermethylation of bull sperm compared to the sperm of rams, stallions, mice, goats and men. Total blood cells displayed a similarly high level of methylation in bulls and rams, suggesting that undermethylation of the bovine genome was specific to sperm. Annotation of CCGG sites in different species revealed no striking bias in the distribution of genome features targeted by LUMA that could explain undermethylation of bull sperm. To map DNA methylation at a genome-wide scale, bull sperm was compared with bovine liver, fibroblasts and monocytes using reduced representation bisulfite sequencing (RRBS) and immunoprecipitation of methylated DNA followed by microarray hybridization (MeDIP-chip). These two methods exhibited differences in terms of genome coverage, and consistently, two independent sets of sequences differentially methylated in sperm and somatic cells were identified for RRBS and MeDIP-chip. Remarkably, in the two sets most of the differentially methylated sequences were hypomethylated in sperm. In agreement with previous studies in other species, the sequences that were specifically hypomethylated in bull sperm targeted processes relevant to the germline differentiation program (piRNA metabolism, meiosis, spermatogenesis) and sperm functions (cell adhesion, fertilization), as well as satellites and rDNA repeats. These results highlight the undermethylation of bull spermatozoa when compared with both bovine somatic cells and the sperm of other mammals, and raise questions regarding the dynamics of DNA methylation in bovine male germline. Whether sperm undermethylation has potential interactions with structural variation in the cattle genome may deserve further attention.

  9. Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

    PubMed

    Adhikari, Badri; Hou, Jie; Cheng, Jianlin

    2018-03-01

    In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66. © 2017 Wiley Periodicals, Inc.

  10. SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.

    PubMed

    Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke

    2008-05-01

    Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.

  11. SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences

    PubMed Central

    Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke

    2008-01-01

    Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616

  12. BioSAVE: display of scored annotation within a sequence context.

    PubMed

    Pollock, Richard F; Adryan, Boris

    2008-03-20

    Visualization of sequence annotation is a common feature in many bioinformatics tools. For many applications it is desirable to restrict the display of such annotation according to a score cutoff, as biological interpretation can be difficult in the presence of the entire data. Unfortunately, many visualisation solutions are somewhat static in the way they handle such score cutoffs. We present BioSAVE, a sequence annotation viewer with on-the-fly selection of visualisation thresholds for each feature. BioSAVE is a versatile OS X program for visual display of scored features (annotation) within a sequence context. The program reads sequence and additional supplementary annotation data (e.g., position weight matrix matches, conservation scores, structural domains) from a variety of commonly used file formats and displays them graphically. Onscreen controls then allow for live customisation of these graphics, including on-the-fly selection of visualisation thresholds for each feature. Possible applications of the program include display of transcription factor binding sites in a genomic context or the visualisation of structural domain assignments in protein sequences and many more. The dynamic visualisation of these annotations is useful, e.g., for the determination of cutoff values of predicted features to match experimental data. Program, source code and exemplary files are freely available at the BioSAVE homepage.

  13. BioSAVE: Display of scored annotation within a sequence context

    PubMed Central

    Pollock, Richard F; Adryan, Boris

    2008-01-01

    Background Visualization of sequence annotation is a common feature in many bioinformatics tools. For many applications it is desirable to restrict the display of such annotation according to a score cutoff, as biological interpretation can be difficult in the presence of the entire data. Unfortunately, many visualisation solutions are somewhat static in the way they handle such score cutoffs. Results We present BioSAVE, a sequence annotation viewer with on-the-fly selection of visualisation thresholds for each feature. BioSAVE is a versatile OS X program for visual display of scored features (annotation) within a sequence context. The program reads sequence and additional supplementary annotation data (e.g., position weight matrix matches, conservation scores, structural domains) from a variety of commonly used file formats and displays them graphically. Onscreen controls then allow for live customisation of these graphics, including on-the-fly selection of visualisation thresholds for each feature. Conclusion Possible applications of the program include display of transcription factor binding sites in a genomic context or the visualisation of structural domain assignments in protein sequences and many more. The dynamic visualisation of these annotations is useful, e.g., for the determination of cutoff values of predicted features to match experimental data. Program, source code and exemplary files are freely available at the BioSAVE homepage. PMID:18366701

  14. Automatic detection of solar features in HSOS full-disk solar images using guided filter

    NASA Astrophysics Data System (ADS)

    Yuan, Fei; Lin, Jiaben; Guo, Jingjing; Wang, Gang; Tong, Liyue; Zhang, Xinwei; Wang, Bingxiang

    2018-02-01

    A procedure is introduced for the automatic detection of solar features using full-disk solar images from Huairou Solar Observing Station (HSOS), National Astronomical Observatories of China. In image preprocessing, median filter is applied to remove the noises. Guided filter is adopted to enhance the edges of solar features and restrain the solar limb darkening, which is first introduced into the astronomical target detection. Then specific features are detected by Otsu algorithm and further threshold processing technique. Compared with other automatic detection procedures, our procedure has some advantages such as real time and reliability as well as no need of local threshold. Also, it reduces the amount of computation largely, which is benefited from the efficient guided filter algorithm. The procedure has been tested on one month sequences (December 2013) of HSOS full-disk solar images and the result shows that the number of features detected by our procedure is well consistent with the manual one.

  15. Broadband excitation in nuclear magnetic resonance

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tycko, Robert

    1984-10-01

    Theoretical methods for designing sequences of radio frequency (rf) radiation pulses for broadband excitation of spin systems in nuclear magnetic resonance (NMR) are described. The sequences excite spins uniformly over large ranges of resonant frequencies arising from static magnetic field inhomogeneity, chemical shift differences, or spin couplings, or over large ranges of rf field amplitudes. Specific sequences for creating a population inversion or transverse magnetization are derived and demonstrated experimentally in liquid and solid state NMR. One approach to broadband excitation is based on principles of coherent averaging theory. A general formalism for deriving pulse sequences is given, along withmore » computational methods for specific cases. This approach leads to sequences that produce strictly constant transformations of a spin system. The importance of this feature in NMR applications is discussed. A second approach to broadband excitation makes use of iterative schemes, i.e. sets of operations that are applied repetitively to a given initial pulse sequences, generating a series of increasingly complex sequences with increasingly desirable properties. A general mathematical framework for analyzing iterative schemes is developed. An iterative scheme is treated as a function that acts on a space of operators corresponding to the transformations produced by all possible pulse sequences. The fixed points of the function and the stability of the fixed points are shown to determine the essential behavior of the scheme. Iterative schemes for broadband population inversion are treated in detail. Algebraic and numerical methods for performing the mathematical analysis are presented. Two additional topics are treated. The first is the construction of sequences for uniform excitation of double-quantum coherence and for uniform polarization transfer over a range of spin couplings. Double-quantum excitation sequences are demonstrated in a liquid crystal system. The second additional topic is the construction of iterative schemes for narrowband population inversion. The use of sequences that invert spin populations only over a narrow range of rf field amplitudes to spatially localize NMR signals in an rf field gradient is discussed.« less

  16. Recognizing human activities using appearance metric feature and kinematics feature

    NASA Astrophysics Data System (ADS)

    Qian, Huimin; Zhou, Jun; Lu, Xinbiao; Wu, Xinye

    2017-05-01

    The problem of automatically recognizing human activities from videos through the fusion of the two most important cues, appearance metric feature and kinematics feature, is considered. And a system of two-dimensional (2-D) Poisson equations is introduced to extract the more discriminative appearance metric feature. Specifically, the moving human blobs are first detected out from the video by background subtraction technique to form a binary image sequence, from which the appearance feature designated as the motion accumulation image and the kinematics feature termed as centroid instantaneous velocity are extracted. Second, 2-D discrete Poisson equations are employed to reinterpret the motion accumulation image to produce a more differentiated Poisson silhouette image, from which the appearance feature vector is created through the dimension reduction technique called bidirectional 2-D principal component analysis, considering the balance between classification accuracy and time consumption. Finally, a cascaded classifier based on the nearest neighbor classifier and two directed acyclic graph support vector machine classifiers, integrated with the fusion of the appearance feature vector and centroid instantaneous velocity vector, is applied to recognize the human activities. Experimental results on the open databases and a homemade one confirm the recognition performance of the proposed algorithm.

  17. The third annual BRDS on research and development of nucleic acid-based nanomedicines

    PubMed Central

    Chaudhary, Amit Kumar

    2017-01-01

    The completion of human genome project, decrease in the sequencing cost, and correlation of genome sequencing data with specific diseases led to the exponential rise in the nucleic acid-based therapeutic approaches. In the third annual Biopharmaceutical Research and Development Symposium (BRDS) held at the Center for Drug Discovery and Lozier Center for Pharmacy Sciences and Education at the University of Nebraska Medical Center (UNMC), we highlighted the remarkable features of the nucleic acid-based nanomedicines, their significance, NIH funding opportunities on nanomedicines and gene therapy research, challenges and opportunities in the clinical translation of nucleic acids into therapeutics, and the role of intellectual property (IP) in drug discovery and development. PMID:27848223

  18. SWI enhances vein detection using gadolinium in multiple sclerosis

    PubMed Central

    Mazzoni, Lorenzo N; Moretti, Marco; Grammatico, Matteo; Chiti, Stefano; Massacesi, Luca

    2015-01-01

    Susceptibility weighted imaging (SWI) combined with the FLAIR sequence provides the ability to depict in vivo the perivenous location of inflammatory demyelinating lesions – one of the most specific pathologic features of multiple sclerosis (MS). In addition, in MS white matter (WM) lesions, gadolinium-based contrast media (CM) can increase vein signal loss on SWI. This report focuses on two cases of WM inflammatory lesions enhancing on SWI images after CM injection. In these lesions in fact the CM increased the contrast between the parenchyma and the central vein allowing as well, in one of the two cases, the detection of a vein not visible on the same SWI sequence acquired before CM injection. PMID:25815209

  19. Revisiting the "enigma" of musicians with dyslexia: Auditory sequencing and speech abilities.

    PubMed

    Zuk, Jennifer; Bishop-Liebler, Paula; Ozernov-Palchik, Ola; Moore, Emma; Overy, Katie; Welch, Graham; Gaab, Nadine

    2017-04-01

    Previous research has suggested a link between musical training and auditory processing skills. Musicians have shown enhanced perception of auditory features critical to both music and speech, suggesting that this link extends beyond basic auditory processing. It remains unclear to what extent musicians who also have dyslexia show these specialized abilities, considering often-observed persistent deficits that coincide with reading impairments. The present study evaluated auditory sequencing and speech discrimination in 52 adults comprised of musicians with dyslexia, nonmusicians with dyslexia, and typical musicians. An auditory sequencing task measuring perceptual acuity for tone sequences of increasing length was administered. Furthermore, subjects were asked to discriminate synthesized syllable continua varying in acoustic components of speech necessary for intraphonemic discrimination, which included spectral (formant frequency) and temporal (voice onset time [VOT] and amplitude envelope) features. Results indicate that musicians with dyslexia did not significantly differ from typical musicians and performed better than nonmusicians with dyslexia for auditory sequencing as well as discrimination of spectral and VOT cues within syllable continua. However, typical musicians demonstrated superior performance relative to both groups with dyslexia for discrimination of syllables varying in amplitude information. These findings suggest a distinct profile of speech processing abilities in musicians with dyslexia, with specific weaknesses in discerning amplitude cues within speech. Because these difficulties seem to remain persistent in adults with dyslexia despite musical training, this study only partly supports the potential for musical training to enhance the auditory processing skills known to be crucial for literacy in individuals with dyslexia. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  20. Complex multi-enhancer contacts captured by Genome Architecture Mapping (GAM)

    PubMed Central

    Beagrie, Robert A.; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C.A.; Chotalia, Mita; Xie, Sheila Q.; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R.; Fraser, James; Dostie, Josée; Game, Laurence; Dillon, Niall; Edwards, Paul A.W.; Nicodemi, Mario; Pombo, Ana

    2017-01-01

    Summary The organization of the genome in the nucleus and the interactions of genes with their regulatory elements are key features of transcriptional control and their disruption can cause disease. We developed a novel genome-wide method, Genome Architecture Mapping (GAM), for measuring chromatin contacts, and other features of three-dimensional chromatin topology, based on sequencing DNA from a large collection of thin nuclear sections. We apply GAM to mouse embryonic stem cells and identify an enrichment for specific interactions between active genes and enhancers across very large genomic distances, using a mathematical model ‘SLICE’ (Statistical Inference of Co-segregation). GAM also reveals an abundance of three-way contacts genome-wide, especially between regions that are highly transcribed or contain super-enhancers, highlighting a previously inaccessible complexity in genome architecture and a major role for gene-expression specific contacts in organizing the genome in mammalian nuclei. PMID:28273065

  1. Sequence features and phylogenetic analysis of the stress protein Hsp90α in chinook salmon Oncorhynchus tshawytscha, a poikilothermic vertebrate

    USGS Publications Warehouse

    Palmisano, Aldo N.; Winton, James R.; Dickhoff, Walton W.

    1999-01-01

    We cloned and sequenced a chinook salmon Hsp90 cDNA; sequence analysis shows it to be Hsp90??. Phylogenetic analysis supports the hypothesis that ?? and ?? paralogs of Hsp90 arose as a result of a gene duplication event and that they diverged early in the evolution of vertebrates, before tetrapods separated from the teleost lineage. Among several differences distinguishing poikilothermic Hsp90?? sequences from their bird and mammal orthologs, the teleost versions specifically lack a characteristic QTQDQP phosphorylation site near the N-terminus. We used the cDNA to develop an RNA (Northern) blot to quantify cellular Hsp90 mRNA levels. Chinook salmon embryonic (CHSE-214) cells responded to heat shock with a rapid rise in Hsp90 mRNA through 4 h, followed by a gradual decline over the next 20 h. Hsp90 mRNA level may be useful as a stress indicator, especially in a laboratory setting or in response to acute heat stress.

  2. High-throughput and site-specific identification of 2'-O-methylation sites using ribose oxidation sequencing (RibOxi-seq).

    PubMed

    Zhu, Yinzhou; Pirnie, Stephan P; Carmichael, Gordon G

    2017-08-01

    Ribose methylation (2'- O -methylation, 2'- O Me) occurs at high frequencies in rRNAs and other small RNAs and is carried out using a shared mechanism across eukaryotes and archaea. As RNA modifications are important for ribosome maturation, and alterations in these modifications are associated with cellular defects and diseases, it is important to characterize the landscape of 2'- O -methylation. Here we report the development of a highly sensitive and accurate method for ribose methylation detection using next-generation sequencing. A key feature of this method is the generation of RNA fragments with random 3'-ends, followed by periodate oxidation of all molecules terminating in 2',3'-OH groups. This allows only RNAs harboring 2'-OMe groups at their 3'-ends to be sequenced. Although currently requiring microgram amounts of starting material, this method is robust for the analysis of rRNAs even at low sequencing depth. © 2017 Zhu et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.

  3. Uniform, optimal signal processing of mapped deep-sequencing data.

    PubMed

    Kumar, Vibhor; Muratani, Masafumi; Rayan, Nirmala Arul; Kraus, Petra; Lufkin, Thomas; Ng, Huck Hui; Prabhakar, Shyam

    2013-07-01

    Despite their apparent diversity, many problems in the analysis of high-throughput sequencing data are merely special cases of two general problems, signal detection and signal estimation. Here we adapt formally optimal solutions from signal processing theory to analyze signals of DNA sequence reads mapped to a genome. We describe DFilter, a detection algorithm that identifies regulatory features in ChIP-seq, DNase-seq and FAIRE-seq data more accurately than assay-specific algorithms. We also describe EFilter, an estimation algorithm that accurately predicts mRNA levels from as few as 1-2 histone profiles (R ∼0.9). Notably, the presence of regulatory motifs in promoters correlates more with histone modifications than with mRNA levels, suggesting that histone profiles are more predictive of cis-regulatory mechanisms. We show by applying DFilter and EFilter to embryonic forebrain ChIP-seq data that regulatory protein identification and functional annotation are feasible despite tissue heterogeneity. The mathematical formalism underlying our tools facilitates integrative analysis of data from virtually any sequencing-based functional profile.

  4. MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing

    PubMed Central

    Diroma, Maria Angela; Santorsola, Mariangela; Guttà, Cristiano; Gasparre, Giuseppe; Picardi, Ernesto; Pesole, Graziano; Attimonelli, Marcella

    2014-01-01

    Motivation: The increasing availability of mitochondria-targeted and off-target sequencing data in whole-exome and whole-genome sequencing studies (WXS and WGS) has risen the demand of effective pipelines to accurately measure heteroplasmy and to easily recognize the most functionally important mitochondrial variants among a huge number of candidates. To this purpose, we developed MToolBox, a highly automated pipeline to reconstruct and analyze human mitochondrial DNA from high-throughput sequencing data. Results: MToolBox implements an effective computational strategy for mitochondrial genomes assembling and haplogroup assignment also including a prioritization analysis of detected variants. MToolBox provides a Variant Call Format file featuring, for the first time, allele-specific heteroplasmy and annotation files with prioritized variants. MToolBox was tested on simulated samples and applied on 1000 Genomes WXS datasets. Availability and implementation: MToolBox package is available at https://sourceforge.net/projects/mtoolbox/. Contact: marcella.attimonelli@uniba.it Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25028726

  5. A corticostriatal deficit promotes temporal distortion of automatic action in ageing

    PubMed Central

    Matamales, Miriam; Skrbis, Zala; Bailey, Matthew R; Balsam, Peter D; Balleine, Bernard W; Götz, Jürgen

    2017-01-01

    The acquisition of motor skills involves implementing action sequences that increase task efficiency while reducing cognitive loads. This learning capacity depends on specific cortico-basal ganglia circuits that are affected by normal ageing. Here, combining a series of novel behavioural tasks with extensive neuronal mapping and targeted cell manipulations in mice, we explored how ageing of cortico-basal ganglia networks alters the microstructure of action throughout sequence learning. We found that, after extended training, aged mice produced shorter actions and displayed squeezed automatic behaviours characterised by ultrafast oligomeric action chunks that correlated with deficient reorganisation of corticostriatal activity. Chemogenetic disruption of a striatal subcircuit in young mice reproduced age-related within-sequence features, and the introduction of an action-related feedback cue temporarily restored normal sequence structure in aged mice. Our results reveal static properties of aged cortico-basal ganglia networks that introduce temporal limits to action automaticity, something that can compromise procedural learning in ageing. PMID:29058672

  6. Analysis of temporal transcription expression profiles reveal links between protein function and developmental stages of Drosophila melanogaster.

    PubMed

    Wan, Cen; Lees, Jonathan G; Minneci, Federico; Orengo, Christine A; Jones, David T

    2017-10-01

    Accurate gene or protein function prediction is a key challenge in the post-genome era. Most current methods perform well on molecular function prediction, but struggle to provide useful annotations relating to biological process functions due to the limited power of sequence-based features in that functional domain. In this work, we systematically evaluate the predictive power of temporal transcription expression profiles for protein function prediction in Drosophila melanogaster. Our results show significantly better performance on predicting protein function when transcription expression profile-based features are integrated with sequence-derived features, compared with the sequence-derived features alone. We also observe that the combination of expression-based and sequence-based features leads to further improvement of accuracy on predicting all three domains of gene function. Based on the optimal feature combinations, we then propose a novel multi-classifier-based function prediction method for Drosophila melanogaster proteins, FFPred-fly+. Interpreting our machine learning models also allows us to identify some of the underlying links between biological processes and developmental stages of Drosophila melanogaster.

  7. Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art.

    PubMed

    Walia, Rasna R; Caragea, Cornelia; Lewis, Benjamin A; Towfic, Fadi; Terribilini, Michael; El-Manzalawy, Yasser; Dobbs, Drena; Honavar, Vasant

    2012-05-10

    RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.

  8. Skeletal development in the African elephant and ossification timing in placental mammals

    PubMed Central

    Hautier, Lionel; Stansfield, Fiona J.; Allen, W. R. Twink; Asher, Robert J.

    2012-01-01

    We provide here unique data on elephant skeletal ontogeny. We focus on the sequence of cranial and post-cranial ossification events during growth in the African elephant (Loxodonta africana). Previous analyses on ossification sequences in mammals have focused on monotremes, marsupials, boreoeutherian and xenarthran placentals. Here, we add data on ossification sequences in an afrotherian. We use two different methods to quantify sequence heterochrony: the sequence method and event-paring/Parsimov. Compared with other placentals, elephants show late ossifications of the basicranium, manual and pedal phalanges, and early ossifications of the ischium and metacarpals. Moreover, ossification in elephants starts very early and progresses rapidly. Specifically, the elephant exhibits the same percentage of bones showing an ossification centre at the end of the first third of its gestation period as the mouse and hamster have close to birth. Elephants show a number of features of their ossification patterns that differ from those of other placental mammals. The pattern of the initiation of the ossification evident in the African elephant underscores a possible correlation between the timing of ossification onset and gestation time throughout mammals. PMID:22298853

  9. Phylogenetic origins of the plant mitochondrion based on a comparative analysis of 5S ribosomal RNA sequences

    NASA Technical Reports Server (NTRS)

    Villanueva, E.; Delihas, N.; Luehrsen, K. R.; Fox, G. E.; Gibson, J.

    1985-01-01

    The complete nucleotide sequences of 5S ribosomal RNAs from Rhodocyclus gelatinosa, Rhodobacter sphaeroides, and Pseudomonas cepacia were determined. Comparisons of these 5S RNA sequences show that rather than being phylogenetically related to one another, the two photosynthetic bacterial 5S RNAs share more sequence and signature homology with the RNAs of two nonphotosynthetic strains. Rhodobacter sphaeroides is specifically related to Paracoccus denitrificans and Rc. gelatinosa is related to Ps. cepacia. These results support earlier 16S ribosomal RNA studies and add two important groups to the 5S RNA data base. Unique 5S RNA structural features previously found in P. denitrificans are present also in the 5S RNA of Rb. sphaeroides; these provide the basis for subdivisional signatures. The immediate consequence of obtaining these new sequences is that it is possible to clarify the phylogenetic origins of the plant mitochondrion. In particular, a close phylogenetic relationship is found between the plant mitochondria and members of the alpha subdivision of the purple photosynthetic bacteria, namely, Rb. sphaeroides, P. denitrificans, and Rhodospirillum rubrum.

  10. Conflict Background Triggered Congruency Sequence Effects in Graphic Judgment Task

    PubMed Central

    Zhao, Liang; Wang, Yonghui

    2013-01-01

    Congruency sequence effects refer to the reduction of congruency effects when following an incongruent trial than following a congruent trial. The conflict monitoring account, one of the most influential contributions to this effect, assumes that the sequential modulations are evoked by response conflict. The present study aimed at exploring the congruency sequence effects in the absence of response conflict. We found congruency sequence effects occurred in graphic judgment task, in which the conflict stimuli acted as irrelevant information. The findings reveal that processing task-irrelevant conflict stimulus features could also induce sequential modulations of interference. The results do not support the interpretation of conflict monitoring and favor a feature integration account that the congruency sequence effects are attributed to the repetitions of stimulus and response features. PMID:23372766

  11. MerCat: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from metagenomic and/or metatranscriptomic sequencing data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    White, Richard A.; Panyala, Ajay R.; Glass, Kevin A.

    MerCat is a parallel, highly scalable and modular property software package for robust analysis of features in next-generation sequencing data. MerCat inputs include assembled contigs and raw sequence reads from any platform resulting in feature abundance counts tables. MerCat allows for direct analysis of data properties without reference sequence database dependency commonly used by search tools such as BLAST and/or DIAMOND for compositional analysis of whole community shotgun sequencing (e.g. metagenomes and metatranscriptomes).

  12. High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA.

    PubMed

    Chandrananda, Dineika; Thorne, Natalie P; Bahlo, Melanie

    2015-06-17

    High-throughput sequencing of cell-free DNA fragments found in human plasma has been used to non-invasively detect fetal aneuploidy, monitor organ transplants and investigate tumor DNA. However, many biological properties of this extracellular genetic material remain unknown. Research that further characterizes circulating DNA could substantially increase its diagnostic value by allowing the application of more sophisticated bioinformatics tools that lead to an improved signal to noise ratio in the sequencing data. In this study, we investigate various features of cell-free DNA in plasma using deep-sequencing data from two pregnant women (>70X, >50X) and compare them with matched cellular DNA. We utilize a descriptive approach to examine how the biological cleavage of cell-free DNA affects different sequence signatures such as fragment lengths, sequence motifs at fragment ends and the distribution of cleavage sites along the genome. We show that the size distributions of these cell-free DNA molecules are dependent on their autosomal and mitochondrial origin as well as the genomic location within chromosomes. DNA mapping to particular microsatellites and alpha repeat elements display unique size signatures. We show how cell-free fragments occur in clusters along the genome, localizing to nucleosomal arrays and are preferentially cleaved at linker regions by correlating the mapping locations of these fragments with ENCODE annotation of chromatin organization. Our work further demonstrates that cell-free autosomal DNA cleavage is sequence dependent. The region spanning up to 10 positions on either side of the DNA cleavage site show a consistent pattern of preference for specific nucleotides. This sequence motif is present in cleavage sites localized to nucleosomal cores and linker regions but is absent in nucleosome-free mitochondrial DNA. These background signals in cell-free DNA sequencing data stem from the non-random biological cleavage of these fragments. This sequence structure can be harnessed to improve bioinformatics algorithms, in particular for CNV and structural variant detection. Descriptive measures for cell-free DNA features developed here could also be used in biomarker analysis to monitor the changes that occur during different pathological conditions.

  13. Presence of a Shared 5'-Leader Sequence in Ancestral Human and Mammalian Retroviruses and Its Transduction into Feline Leukemia Virus.

    PubMed

    Kawasaki, Junna; Kawamura, Maki; Ohsato, Yoshiharu; Ito, Jumpei; Nishigaki, Kazuo

    2017-10-15

    Recombination events induce significant genetic changes, and this process can result in virus genetic diversity or in the generation of novel pathogenicity. We discovered a new recombinant feline leukemia virus (FeLV) gag gene harboring an unrelated insertion, termed the X region, which was derived from Felis catus endogenous gammaretrovirus 4 (FcERV-gamma4). The identified FcERV-gamma4 proviruses have lost their coding capabilities, but some can express their viral RNA in feline tissues. Although the X-region-carrying recombinant FeLVs appeared to be replication-defective viruses, they were detected in 6.4% of tested FeLV-infected cats. All isolated recombinant FeLV clones commonly incorporated a middle part of the FcERV-gamma4 5'-leader region as an X region. Surprisingly, a sequence corresponding to the portion contained in all X regions is also present in at least 13 endogenous retroviruses (ERVs) observed in the cat, human, primate, and pig genomes. We termed this shared genetic feature the commonly shared (CS) sequence. Despite our phylogenetic analysis indicating that all CS-sequence-carrying ERVs are classified as gammaretroviruses, no obvious closeness was revealed among these ERVs. However, the Shannon entropy in the CS sequence was lower than that in other parts of the provirus genome. Notably, the CS sequence of human endogenous retrovirus T had 73.8% similarity with that of FcERV-gamma4, and specific signals were detected in the human genome by Southern blot analysis using a probe for the FcERV-gamma4 CS sequence. Our results provide an interesting evolutionary history for CS-sequence circulation among several distinct ancestral viruses and a novel recombined virus over a prolonged period. IMPORTANCE Recombination among ERVs or modern viral genomes causes a rapid evolution of retroviruses, and this phenomenon can result in the serious situation of viral disease reemergence. We identified a novel recombinant FeLV gag gene that contains an unrelated sequence, termed the X region. This region originated from the 5' leader of FcERV-gamma4, a replication-incompetent feline ERV. Surprisingly, a sequence corresponding to the X region is also present in the 5' portion of other ERVs, including human endogenous retroviruses. Scattered copies of the ERVs carrying the unique genetic feature, here named the commonly shared (CS) sequence, were found in each host genome, suggesting that ancestral viruses may have captured and maintained the CS sequence. More recently, a novel recombinant FeLV hijacked the CS sequence from inactivated FcERV-gamma4 as the X region. Therefore, tracing the CS sequences can provide unique models for not only the modern reservoir of new recombinant viruses but also the genetic features shared among ancient retroviruses. Copyright © 2017 American Society for Microbiology.

  14. Drinking from the Fire Hose: Why the Flight Management System Can Be Hard to Train and Difficult to Use

    NASA Technical Reports Server (NTRS)

    Sherry, Lance; Feary, Michael; Polson, Peter; Fennell, Karl

    2003-01-01

    The Flight Management Computer (FMC) and its interface, the Multi-function Control and Display Unit (MCDU) have been identified by researchers and airlines as difficult to train and use. Specifically, airline pilots have described the "drinking from the fire-hose" effect during training. Previous research has identified memorized action sequences as a major factor in a user s ability to learn and operate complex devices. This paper discusses the use of a method to examine the quantity of memorized action sequences required to perform a sample of 102 tasks, using features of the Boeing 777 Flight Management Computer Interface. The analysis identified a large number of memorized action sequences that must be learned during training and then recalled during line operations. Seventy-five percent of the tasks examined require recall of at least one memorized action sequence. Forty-five percent of the tasks require recall of a memorized action sequence and occur infrequently. The large number of memorized action sequences may provide an explanation for the difficulties in training and usage of the automation. Based on these findings, implications for training and the design of new user-interfaces are discussed.

  15. Novel insertion mutation of ABCB1 gene in an ivermectin-sensitive Border Collie.

    PubMed

    Han, Jae-Ik; Son, Hyoung-Won; Park, Seung-Cheol; Na, Ki-Jeong

    2010-12-01

    P-glycoprotein (P-gp) is encoded by the ABCB1 gene and acts as an efflux pump for xenobiotics. In the Border Collie, a nonsense mutation caused by a 4-base pair deletion in the ABCB1 gene is associated with a premature stop to P-gp synthesis. In this study, we examined the full-length coding sequence of the ABCB1 gene in an ivermectin-sensitive Border Collie that lacked the aforementioned deletion mutation. The sequence was compared to the corresponding sequences of a wild-type Beagle and seven ivermectin-tolerant family members of the Border Collie. When compared to the wild-type Beagle sequence, that of the ivermectin-sensitive Border Collie was found to have one insertion mutation and eight single nucleotide polymorphisms (SNPs) in the coding sequence of the ABCB1 gene. While the eight SNPs were also found in the family members' sequences, the insertion mutation was found only in the ivermectin-sensitive dog. These results suggest the possibility that the SNPs are species-specific features of the ABCB1 gene in Border Collies, and that the insertion mutation may be related to ivermectin intolerance.

  16. Novel insertion mutation of ABCB1 gene in an ivermectin-sensitive Border Collie

    PubMed Central

    Han, Jae-Ik; Son, Hyoung-Won; Park, Seung-Cheol

    2010-01-01

    P-glycoprotein (P-gp) is encoded by the ABCB1 gene and acts as an efflux pump for xenobiotics. In the Border Collie, a nonsense mutation caused by a 4-base pair deletion in the ABCB1 gene is associated with a premature stop to P-gp synthesis. In this study, we examined the full-length coding sequence of the ABCB1 gene in an ivermectin-sensitive Border Collie that lacked the aforementioned deletion mutation. The sequence was compared to the corresponding sequences of a wild-type Beagle and seven ivermectin-tolerant family members of the Border Collie. When compared to the wild-type Beagle sequence, that of the ivermectin-sensitive Border Collie was found to have one insertion mutation and eight single nucleotide polymorphisms (SNPs) in the coding sequence of the ABCB1 gene. While the eight SNPs were also found in the family members' sequences, the insertion mutation was found only in the ivermectin-sensitive dog. These results suggest the possibility that the SNPs are species-specific features of the ABCB1 gene in Border Collies, and that the insertion mutation may be related to ivermectin intolerance. PMID:21113104

  17. Comparative sequence analysis suggests a conserved gating mechanism for TRP channels

    PubMed Central

    Palovcak, Eugene; Delemotte, Lucie; Klein, Michael L.

    2015-01-01

    The transient receptor potential (TRP) channel superfamily plays a central role in transducing diverse sensory stimuli in eukaryotes. Although dissimilar in sequence and domain organization, all known TRP channels act as polymodal cellular sensors and form tetrameric assemblies similar to those of their distant relatives, the voltage-gated potassium (Kv) channels. Here, we investigated the related questions of whether the allosteric mechanism underlying polymodal gating is common to all TRP channels, and how this mechanism differs from that underpinning Kv channel voltage sensitivity. To provide insight into these questions, we performed comparative sequence analysis on large, comprehensive ensembles of TRP and Kv channel sequences, contextualizing the patterns of conservation and correlation observed in the TRP channel sequences in light of the well-studied Kv channels. We report sequence features that are specific to TRP channels and, based on insight from recent TRPV1 structures, we suggest a model of TRP channel gating that differs substantially from the one mediating voltage sensitivity in Kv channels. The common mechanism underlying polymodal gating involves the displacement of a defect in the H-bond network of S6 that changes the orientation of the pore-lining residues at the hydrophobic gate. PMID:26078053

  18. GeneChip{sup {trademark}} screening assay for cystic fibrosis mutations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cronn, M.T.; Miyada, C.G.; Fucini, R.V.

    1994-09-01

    GeneChip{sup {trademark}} assays are based on high density, carefully designed arrays of short oligonucleotide probes (13-16 bases) built directly on derivatized silica substrates. DNA target sequence analysis is achieved by hybridizing fluorescently labeled amplification products to these arrays. Fluorescent hybridization signals located within the probe array are translated into target sequence information using the known probe sequence at each array feature. The mutation screening assay for cystic fibrosis includes sets of oligonucleotide probes designed to detect numerous different mutations that have been described in 14 exons and one intron of the CFTR gene. Each mutation site is addressed by amore » sub-array of at least 40 probe sequences, half designed to detect the wild type gene sequence and half designed to detect the reported mutant sequence. Hybridization with homozygous mutant, homozygous wild type or heterozygous targets results in distinctive hybridization patterns within a sub-array, permitting specific discrimination of each mutation. The GeneChip probe arrays are very small (approximately 1 cm{sup 2}). There miniature size coupled with their high information content make GeneChip probe arrays a useful and practical means for providing CF mutation analysis in a clinical setting.« less

  19. Organization and evolution of highly repeated satellite DNA sequences in plant chromosomes.

    PubMed

    Sharma, S; Raina, S N

    2005-01-01

    A major component of the plant nuclear genome is constituted by different classes of repetitive DNA sequences. The structural, functional and evolutionary aspects of the satellite repetitive DNA families, and their organization in the chromosomes is reviewed. The tandem satellite DNA sequences exhibit characteristic chromosomal locations, usually at subtelomeric and centromeric regions. The repetitive DNA family(ies) may be widely distributed in a taxonomic family or a genus, or may be specific for a species, genome or even a chromosome. They may acquire large-scale variations in their sequence and copy number over an evolutionary time-scale. These features have formed the basis of extensive utilization of repetitive sequences for taxonomic and phylogenetic studies. Hybrid polyploids have especially proven to be excellent models for studying the evolution of repetitive DNA sequences. Recent studies explicitly show that some repetitive DNA families localized at the telomeres and centromeres have acquired important structural and functional significance. The repetitive elements are under different evolutionary constraints as compared to the genes. Satellite DNA families are thought to arise de novo as a consequence of molecular mechanisms such as unequal crossing over, rolling circle amplification, replication slippage and mutation that constitute "molecular drive". Copyright 2005 S. Karger AG, Basel.

  20. Prediction of Protein Modification Sites of Pyrrolidone Carboxylic Acid Using mRMR Feature Selection and Analysis

    PubMed Central

    Zheng, Lu-Lu; Niu, Shen; Hao, Pei; Feng, KaiYan; Cai, Yu-Dong; Li, Yixue

    2011-01-01

    Pyrrolidone carboxylic acid (PCA) is formed during a common post-translational modification (PTM) of extracellular and multi-pass membrane proteins. In this study, we developed a new predictor to predict the modification sites of PCA based on maximum relevance minimum redundancy (mRMR) and incremental feature selection (IFS). We incorporated 727 features that belonged to 7 kinds of protein properties to predict the modification sites, including sequence conservation, residual disorder, amino acid factor, secondary structure and solvent accessibility, gain/loss of amino acid during evolution, propensity of amino acid to be conserved at protein-protein interface and protein surface, and deviation of side chain carbon atom number. Among these 727 features, 244 features were selected by mRMR and IFS as the optimized features for the prediction, with which the prediction model achieved a maximum of MCC of 0.7812. Feature analysis showed that all feature types contributed to the modification process. Further site-specific feature analysis showed that the features derived from PCA's surrounding sites contributed more to the determination of PCA sites than other sites. The detailed feature analysis in this paper might provide important clues for understanding the mechanism of the PCA formation and guide relevant experimental validations. PMID:22174779

  1. Multilevel analysis of sports video sequences

    NASA Astrophysics Data System (ADS)

    Han, Jungong; Farin, Dirk; de With, Peter H. N.

    2006-01-01

    We propose a fully automatic and flexible framework for analysis and summarization of tennis broadcast video sequences, using visual features and specific game-context knowledge. Our framework can analyze a tennis video sequence at three levels, which provides a broad range of different analysis results. The proposed framework includes novel pixel-level and object-level tennis video processing algorithms, such as a moving-player detection taking both the color and the court (playing-field) information into account, and a player-position tracking algorithm based on a 3-D camera model. Additionally, we employ scene-level models for detecting events, like service, base-line rally and net-approach, based on a number real-world visual features. The system can summarize three forms of information: (1) all court-view playing frames in a game, (2) the moving trajectory and real-speed of each player, as well as relative position between the player and the court, (3) the semantic event segments in a game. The proposed framework is flexible in choosing the level of analysis that is desired. It is effective because the framework makes use of several visual cues obtained from the real-world domain to model important events like service, thereby increasing the accuracy of the scene-level analysis. The paper presents attractive experimental results highlighting the system efficiency and analysis capabilities.

  2. Precision of working memory for visual motion sequences and transparent motion surfaces.

    PubMed

    Zokaei, Nahid; Gorgoraptis, Nikos; Bahrami, Bahador; Bays, Paul M; Husain, Masud

    2011-12-01

    Recent studies investigating working memory for location, color, and orientation support a dynamic resource model. We examined whether this might also apply to motion, using random dot kinematograms (RDKs) presented sequentially or simultaneously. Mean precision for motion direction declined as sequence length increased, with precision being lower for earlier RDKs. Two alternative models of working memory were compared specifically to distinguish between the contributions of different sources of error that corrupt memory (W. Zhang & S. J. Luck, 2008 vs. P. M. Bays, R. F. G. Catalao, & M. Husain, 2009). The latter provided a significantly better fit for the data, revealing that decrease in memory precision for earlier items is explained by an increase in interference from other items in a sequence rather than random guessing or a temporal decay of information. Misbinding feature attributes is an important source of error in working memory. Precision of memory for motion direction decreased when two RDKs were presented simultaneously as transparent surfaces, compared to sequential RDKs. However, precision was enhanced when one motion surface was prioritized, demonstrating that selective attention can improve recall precision. These results are consistent with a resource model that can be used as a general conceptual framework for understanding working memory across a range of visual features.

  3. Intentional preparation of auditory attention-switches: Explicit cueing and sequential switch-predictability.

    PubMed

    Seibold, Julia C; Nolden, Sophie; Oberem, Josefa; Fels, Janina; Koch, Iring

    2018-06-01

    In an auditory attention-switching paradigm, participants heard two simultaneously spoken number-words, each presented to one ear, and decided whether the target number was smaller or larger than 5 by pressing a left or right key. An instructional cue in each trial indicated which feature had to be used to identify the target number (e.g., female voice). Auditory attention-switch costs were found when this feature changed compared to when it repeated in two consecutive trials. Earlier studies employing this paradigm showed mixed results when they examined whether such cued auditory attention-switches can be prepared actively during the cue-stimulus interval. This study systematically assessed which preconditions are necessary for the advance preparation of auditory attention-switches. Three experiments were conducted that controlled for cue-repetition benefits, modality switches between cue and stimuli, as well as for predictability of the switch-sequence. Only in the third experiment, in which predictability for an attention-switch was maximal due to a pre-instructed switch-sequence and predictable stimulus onsets, active switch-specific preparation was found. These results suggest that the cognitive system can prepare auditory attention-switches, and this preparation seems to be triggered primarily by the memorised switching-sequence and valid expectations about the time of target onset.

  4. Sequence features associated with the cleavage efficiency of CRISPR/Cas9 system.

    PubMed

    Liu, Xiaoxi; Homma, Ayaka; Sayadi, Jamasb; Yang, Shu; Ohashi, Jun; Takumi, Toru

    2016-01-27

    The CRISPR-Cas9 system has recently emerged as a versatile tool for biological and medical research. In this system, a single guide RNA (sgRNA) directs the endonuclease Cas9 to a targeted DNA sequence for site-specific manipulation. In addition to this targeting function, the sgRNA has also been shown to play a role in activating the endonuclease activity of Cas9. This dual function of the sgRNA likely underlies observations that different sgRNAs have varying on-target activities. Currently, our understanding of the relationship between sequence features of sgRNAs and their on-target cleavage efficiencies remains limited, largely due to difficulties in assessing the cleavage capacity of a large number of sgRNAs. In this study, we evaluated the cleavage activities of 218 sgRNAs using in vitro Surveyor assays. We found that nucleotides at both PAM-distal and PAM-proximal regions of the sgRNA are significantly correlated with on-target efficiency. Furthermore, we also demonstrated that the genomic context of the targeted DNA, the GC percentage, and the secondary structure of sgRNA are critical factors contributing to cleavage efficiency. In summary, our study reveals important parameters for the design of sgRNAs with high on-target efficiencies, especially in the context of high throughput applications.

  5. MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction.

    PubMed

    Fang, Chao; Shang, Yi; Xu, Dong

    2018-05-01

    Protein secondary structure prediction can provide important information for protein 3D structure prediction and protein functions. Deep learning offers a new opportunity to significantly improve prediction accuracy. In this article, a new deep neural network architecture, named the Deep inception-inside-inception (Deep3I) network, is proposed for protein secondary structure prediction and implemented as a software tool MUFOLD-SS. The input to MUFOLD-SS is a carefully designed feature matrix corresponding to the primary amino acid sequence of a protein, which consists of a rich set of information derived from individual amino acid, as well as the context of the protein sequence. Specifically, the feature matrix is a composition of physio-chemical properties of amino acids, PSI-BLAST profile, and HHBlits profile. MUFOLD-SS is composed of a sequence of nested inception modules and maps the input matrix to either eight states or three states of secondary structures. The architecture of MUFOLD-SS enables effective processing of local and global interactions between amino acids in making accurate prediction. In extensive experiments on multiple datasets, MUFOLD-SS outperformed the best existing methods and other deep neural networks significantly. MUFold-SS can be downloaded from http://dslsrv8.cs.missouri.edu/~cf797/MUFoldSS/download.html. © 2018 Wiley Periodicals, Inc.

  6. Genomic analysis and temperature-dependent transcriptome profiles of the rhizosphere originating strain Pseudomonas aeruginosa M18

    PubMed Central

    2011-01-01

    Background Our previously published reports have described an effective biocontrol agent named Pseudomonas sp. M18 as its 16S rDNA sequence and several regulator genes share homologous sequences with those of P. aeruginosa, but there are several unusual phenotypic features. This study aims to explore its strain specific genomic features and gene expression patterns at different temperatures. Results The complete M18 genome is composed of a single chromosome of 6,327,754 base pairs containing 5684 open reading frames. Seven genomic islands, including two novel prophages and five specific non-phage islands were identified besides the conserved P. aeruginosa core genome. Each prophage contains a putative chitinase coding gene, and the prophage II contains a capB gene encoding a putative cold stress protein. The non-phage genomic islands contain genes responsible for pyoluteorin biosynthesis, environmental substance degradation and type I and III restriction-modification systems. Compared with other P. aeruginosa strains, the fewest number (3) of insertion sequences and the most number (3) of clustered regularly interspaced short palindromic repeats in M18 genome may contribute to the relative genome stability. Although the M18 genome is most closely related to that of P. aeruginosa strain LESB58, the strain M18 is more susceptible to several antimicrobial agents and easier to be erased in a mouse acute lung infection model than the strain LESB58. The whole M18 transcriptomic analysis indicated that 10.6% of the expressed genes are temperature-dependent, with 22 genes up-regulated at 28°C in three non-phage genomic islands and one prophage but none at 37°C. Conclusions The P. aeruginosa strain M18 has evolved its specific genomic structures and temperature dependent expression patterns to meet the requirement of its fitness and competitiveness under selective pressures imposed on the strain in rhizosphere niche. PMID:21884571

  7. Visualization of diffuse centromeres with centromere-specific histone H3 in the holocentric plant Luzula nivea.

    PubMed

    Nagaki, Kiyotaka; Kashihara, Kazunari; Murata, Minoru

    2005-07-01

    Although holocentric species are scattered throughout the plant and animal kingdoms, only holocentric chromosomes of the nematode worm Caenorhabditis elegans have been analyzed with centromeric protein markers. In an effort to determine the holocentric structure in plants, we investigated the snowy woodrush Luzula nivea. From the young roots, a cDNA encoding a putative centromere-specific histone H3 (LnCENH3) was successfully isolated based on sequence similarity among plant CENH3s. The deduced amino acid sequence was then used to raise an anti-LnCENH3 antibody. Immunostaining clearly revealed the diffuse centromere-like structure that appears in the linear shape at prophase to telophase. Furthermore, it was shown that the amount of LnCENH3 decreased significantly at interphase. The polar side positioning on each chromatid at metaphase to anaphase also confirmed that LnCENH3 represents one of the centromere-specific proteins in L. nivea. These data from L. nivea are compared with those from C. elegans, and common features of holocentric chromosomes are discussed.

  8. PUF Proteins: Cellular Functions and Potential Applications.

    PubMed

    Kiani, Seyed Jalal; Taheri, Tahereh; Rafati, Sima; Samimi-Rad, Katayoun

    2017-01-01

    RNA-binding proteins play critical roles in the regulation of gene expression. Among several families of RNA-binding proteins, PUF (Pumilio and FBF) proteins have been the subject of extensive investigations, as they can bind RNA in a sequence-specific manner and they are evolutionarily conserved among a wide range of organisms. The outstanding feature of these proteins is a highly conserved RNA-binding domain, which is known as the Pumilio-homology domain (PUM-HD) that mostly consists of eight tandem repeats. Each repeat recognizes an RNA base with a simple three-letter code that can be programmed in order to change the sequence-specificity of the protein. Using this tailored architecture, researchers have been able to change the specificity of the PUM-HD and target desired transcripts in the cell, even in subcellular compartments. The potential applications of this versatile tool in molecular cell biology seem unbounded and the use of these factors in pharmaceutics might be an interesting field of study in near future. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  9. Encapsulins: microbial nanocompartments with applications in biomedicine, nanobiotechnology and materials science.

    PubMed

    Giessen, Tobias W

    2016-10-01

    Compartmentalization is one of the defining features of life. Cells use protein compartments to exert spatial control over their metabolism, store nutrients and create unique microenvironments needed for essential physiological processes. Encapsulins are a recently discovered class of protein nanocompartments found in bacteria and archaea that naturally encapsulate cargo proteins. A short C-terminal targeting sequence directs the highly specific encapsulation process in vivo. Here, I will initially discuss the properties, diversity and putative function of encapsulins. The unique characteristics and potential uses of the self-sorting cargo-packaging process found in encapsulin systems will then be highlighted. Examples for the application of encapsulins as cell-specific optical nanoprobes and targeted therapeutic delivery systems will be discussed with an emphasis on the ability to integrate multiple functionalities within a single nanodevice. By fusing targeting sequences to non-native proteins, encapsulins can also be used as specific nanocontainers and enzymatic nanoreactors in vivo. I will end by briefly discussing future avenues for encapsulin research related to both basic microbial metabolism and applications in biomedicine, catalysis and materials science. Copyright © 2016 Elsevier Ltd. All rights reserved.

  10. Homeologous plastid DNA transformation in tobacco is mediated by multiple recombination events.

    PubMed Central

    Kavanagh, T A; Thanh, N D; Lao, N T; McGrath, N; Peter, S O; Horváth, E M; Dix, P J; Medgyesy, P

    1999-01-01

    Efficient plastid transformation has been achieved in Nicotiana tabacum using cloned plastid DNA of Solanum nigrum carrying mutations conferring spectinomycin and streptomycin resistance. The use of the incompletely homologous (homeologous) Solanum plastid DNA as donor resulted in a Nicotiana plastid transformation frequency comparable with that of other experiments where completely homologous plastid DNA was introduced. Physical mapping and nucleotide sequence analysis of the targeted plastid DNA region in the transformants demonstrated efficient site-specific integration of the 7.8-kb Solanum plastid DNA and the exclusion of the vector DNA. The integration of the cloned Solanum plastid DNA into the Nicotiana plastid genome involved multiple recombination events as revealed by the presence of discontinuous tracts of Solanum-specific sequences that were interspersed between Nicotiana-specific markers. Marked position effects resulted in very frequent cointegration of the nonselected peripheral donor markers located adjacent to the vector DNA. Data presented here on the efficiency and features of homeologous plastid DNA recombination are consistent with the existence of an active RecA-mediated, but a diminished mismatch, recombination/repair system in higher-plant plastids. PMID:10388829

  11. Molecular Design of Antifouling Polymer Brushes Using Sequence-Specific Peptoids.

    PubMed

    Lau, King Hang Aaron; Sileika, Tadas S; Park, Sung Hyun; Sousa, Ana Maria Leal; Burch, Patrick; Szleifer, Igal; Messersmith, Phillip B

    2015-01-07

    Material systems that can be used to flexibly and precisely define the chemical nature and molecular arrangement of a surface would be invaluable for the control of complex biointerfacial interactions. For example, progress in antifouling polymer biointerfaces that prevent non-specific protein adsorption and cell attachment, which can significantly improve the performance of an array of biomedical and industrial applications, is hampered by a lack of chemical models to identify the molecular features conferring their properties. Poly(N-substituted glycine) "peptoids" are peptidomimetic polymers that can be conveniently synthesized with specific monomer sequences and chain lengths, and are presented as a versatile platform for investigating the molecular design of antifouling polymer brushes. Zwitterionic antifouling polymer brushes have captured significant recent attention, and a targeted library of zwitterionic peptoid brushes with a different charge densities, hydration, separations between charged groups, chain lengths, and grafted chain densities, is quantitatively evaluated for their antifouling properties through a range of protein adsorption and cell attachment assays. Specific zwitterionic brush designs were found to give rise to distinct but subtle differences in properties. The results also point to the dominant roles of the grafted chain density and chain length in determining the performance of antifouling polymer brushes.

  12. Genome sequence of the pink–pigmented marine bacterium Loktanella hongkongensis type strain (UST950701–009PT), a representative of the Roseobacter group

    DOE PAGES

    Lau, Stanley CK; Riedel, Thomas; Fiebig, Anne; ...

    2015-08-11

    Loktanella hongkongensis UST950701-009PT is a Gram-negative, non-motile and rod-shaped bacterium isolated from a marine biofilm in the subtropical seawater of Hong Kong. When growing as a monospecies biofilm on polystyrene surfaces, this bacterium is able to induce larval settlement and metamorphosis of a ubiquitous polychaete tubeworm Hydroides elegans. The inductive cues are low-molecular weight compounds bound to the exopolymeric matrix of the bacterial cells. In the present study we describe the features of L. hongkongensis strain DSM 17492T together with its genome sequence and annotation and novel aspects of its phenotype. The 3,198,444 bp long genome sequence encodes 3104 protein-codingmore » genes and 57 RNA genes. Lastly, the two unambiguously identified extrachromosomal replicons contain replication modules of the RepB and the Rhodobacteraceae-specific DnaA-like type, respectively.« less

  13. Minke whale genome and aquatic adaptation in cetaceans

    PubMed Central

    Yim, Hyung-Soon; Cho, Yun Sung; Guang, Xuanmin; Kang, Sung Gyun; Jeong, Jae-Yeon; Cha, Sun-Shin; Oh, Hyun-Myung; Lee, Jae-Hak; Yang, Eun Chan; Kwon, Kae Kyoung; Kim, Yun Jae; Kim, Tae Wan; Kim, Wonduck; Jeon, Jeong Ho; Kim, Sang-Jin; Choi, Dong Han; Jho, Sungwoong; Kim, Hak-Min; Ko, Junsu; Kim, Hyunmin; Shin, Young-Ah; Jung, Hyun-Ju; Zheng, Yuan; Wang, Zhuo; Chen, Yan

    2014-01-01

    The shift from terrestrial to aquatic life by whales was a substantial evolutionary event. Here we report the whole-genome sequencing and de novo assembly of the minke whale genome, as well as the whole-genome sequences of three minke whales, a fin whale, a bottlenose dolphin and a finless porpoise. Our comparative genomic analysis identified an expansion in the whale lineage of gene families associated with stress-responsive proteins and anaerobic metabolism, whereas gene families related to body hair and sensory receptors were contracted. Our analysis also identified whale-specific mutations in genes encoding antioxidants and enzymes controlling blood pressure and salt concentration. Overall the whale-genome sequences exhibited distinct features that are associated with the physiological and morphological changes needed for life in an aquatic environment, marked by resistance to physiological stresses caused by a lack of oxygen, increased amounts of reactive oxygen species and high salt levels. PMID:24270359

  14. Minke whale genome and aquatic adaptation in cetaceans.

    PubMed

    Yim, Hyung-Soon; Cho, Yun Sung; Guang, Xuanmin; Kang, Sung Gyun; Jeong, Jae-Yeon; Cha, Sun-Shin; Oh, Hyun-Myung; Lee, Jae-Hak; Yang, Eun Chan; Kwon, Kae Kyoung; Kim, Yun Jae; Kim, Tae Wan; Kim, Wonduck; Jeon, Jeong Ho; Kim, Sang-Jin; Choi, Dong Han; Jho, Sungwoong; Kim, Hak-Min; Ko, Junsu; Kim, Hyunmin; Shin, Young-Ah; Jung, Hyun-Ju; Zheng, Yuan; Wang, Zhuo; Chen, Yan; Chen, Ming; Jiang, Awei; Li, Erli; Zhang, Shu; Hou, Haolong; Kim, Tae Hyung; Yu, Lili; Liu, Sha; Ahn, Kung; Cooper, Jesse; Park, Sin-Gi; Hong, Chang Pyo; Jin, Wook; Kim, Heui-Soo; Park, Chankyu; Lee, Kyooyeol; Chun, Sung; Morin, Phillip A; O'Brien, Stephen J; Lee, Hang; Kimura, Jumpei; Moon, Dae Yeon; Manica, Andrea; Edwards, Jeremy; Kim, Byung Chul; Kim, Sangsoo; Wang, Jun; Bhak, Jong; Lee, Hyun Sook; Lee, Jung-Hyun

    2014-01-01

    The shift from terrestrial to aquatic life by whales was a substantial evolutionary event. Here we report the whole-genome sequencing and de novo assembly of the minke whale genome, as well as the whole-genome sequences of three minke whales, a fin whale, a bottlenose dolphin and a finless porpoise. Our comparative genomic analysis identified an expansion in the whale lineage of gene families associated with stress-responsive proteins and anaerobic metabolism, whereas gene families related to body hair and sensory receptors were contracted. Our analysis also identified whale-specific mutations in genes encoding antioxidants and enzymes controlling blood pressure and salt concentration. Overall the whale-genome sequences exhibited distinct features that are associated with the physiological and morphological changes needed for life in an aquatic environment, marked by resistance to physiological stresses caused by a lack of oxygen, increased amounts of reactive oxygen species and high salt levels.

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ma, Xiang; Zhang, Shuai; Jiao, Fang

    Two-step nucleation pathways in which disordered, amorphous, or dense liquid states precede appearance of crystalline phases have been reported for a wide range of materials, but the dynamics of such pathways are poorly understood. Moreover, whether these pathways are general features of crystallizing systems or a consequence of system-specific structural details that select for direct vs two-step processes is unknown. Using atomic force microscopy to directly observe crystallization of sequence-defined polymers, we show that crystallization pathways are indeed sequence dependent. When a short hydrophobic region is added to a sequence that directly forms crystalline particles, crystallization instead follows a two-stepmore » pathway that begins with creation of disordered clusters of 10-20 molecules and is characterized by highly non-linear crystallization kinetics in which clusters transform into ordered structures that then enter the growth phase. The results shed new light on non-classical crystallization mechanisms and have implications for design of self-assembling polymer systems.« less

  16. CRISPRTarget

    PubMed Central

    Biswas, Ambarish; Gagnon, Joshua N.; Brouns, Stan J.J.; Fineran, Peter C.; Brown, Chris M.

    2013-01-01

    The bacterial and archaeal CRISPR/Cas adaptive immune system targets specific protospacer nucleotide sequences in invading organisms. This requires base pairing between processed CRISPR RNA and the target protospacer. For type I and II CRISPR/Cas systems, protospacer adjacent motifs (PAM) are essential for target recognition, and for type III, mismatches in the flanking sequences are important in the antiviral response. In this study, we examine the properties of each class of CRISPR. We use this information to provide a tool (CRISPRTarget) that predicts the most likely targets of CRISPR RNAs (http://bioanalysis.otago.ac.nz/CRISPRTarget). This can be used to discover targets in newly sequenced genomic or metagenomic data. To test its utility, we discover features and targets of well-characterized Streptococcus thermophilus and Sulfolobus solfataricus type II and III CRISPR/Cas systems. Finally, in Pectobacterium species, we identify new CRISPR targets and propose a model of temperate phage exposure and subsequent inhibition by the type I CRISPR/Cas systems. PMID:23492433

  17. Separation of endogenous viral elements from infectious Penaeus stylirostris densovirus using recombinase polymerase amplification.

    PubMed

    Jaroenram, Wansadaj; Owens, Leigh

    2014-01-01

    Non-infectious Penaeus stylirostris densovirus (PstDV)-related sequences in the shrimp genome cause false positive results with current PCR protocols. Here, we examined and mapped PstDV insertion profile in the genome of Australian Penaeus monodon. A DNA sequence which is likely to represent infectious PstDV was also identified and used as a target sequence for recombinase polymerase amplification (RPA)-based approach, developed for specifically detecting PstDV. The RPA protocol at 37 °C for 30 min showed no cross-reaction with other shrimp viruses, and was 10 times more sensitive than the 309F/R PCR protocol currently recommended by the World Organization for Animal Health (OIE) for PstDV diagnosis. These features, together with the simplicity of the protocol, requiring only a heating block for the reaction, offer opportunities for rapid and efficient detection of PstDV. Copyright © 2014 Elsevier Ltd. All rights reserved.

  18. Genome sequence of the pink-pigmented marine bacterium Loktanella hongkongensis type strain (UST950701-009P(T)), a representative of the Roseobacter group.

    PubMed

    Lau, Stanley Ck; Riedel, Thomas; Fiebig, Anne; Han, James; Huntemann, Marcel; Petersen, Jörn; Ivanova, Natalia N; Markowitz, Victor; Woyke, Tanja; Göker, Markus; Kyrpides, Nikos C; Klenk, Hans-Peter; Qian, Pei-Yuan

    2015-01-01

    Loktanella hongkongensis UST950701-009P(T) is a Gram-negative, non-motile and rod-shaped bacterium isolated from a marine biofilm in the subtropical seawater of Hong Kong. When growing as a monospecies biofilm on polystyrene surfaces, this bacterium is able to induce larval settlement and metamorphosis of a ubiquitous polychaete tubeworm Hydroides elegans. The inductive cues are low-molecular weight compounds bound to the exopolymeric matrix of the bacterial cells. In the present study we describe the features of L. hongkongensis strain DSM 17492(T) together with its genome sequence and annotation and novel aspects of its phenotype. The 3,198,444 bp long genome sequence encodes 3104 protein-coding genes and 57 RNA genes. The two unambiguously identified extrachromosomal replicons contain replication modules of the RepB and the Rhodobacteraceae-specific DnaA-like type, respectively.

  19. DNA sequence requirements for the accurate transcription of a protein-coding plastid gene in a plastid in vitro system from mustard (Sinapis alba L.)

    PubMed Central

    Link, Gerhard

    1984-01-01

    A nuclease-treated plastid extract from mustard (Sinapis alba L.) allows efficient transcription of cloned plastid DNA templates. In this in vitro system, the major runoff transcript of the truncated gene for the 32 000 mol. wt. photosystem II protein was accurately initiated from a site close to or identical with the in vivo start site. By using plasmids with deletions in the 5'-flanking region of this gene as templates, a DNA region required for efficient and selective initiation was detected ˜28-35 nucleotides upstream of the transcription start site. This region contains the sequence element TTGACA, which matches the consensus sequence for prokaryotic `−35' promoter elements. In the absence of this region, a region ˜13-27 nucleotides upstream of the start site still enables a basic level of specific transcription. This second region contains the sequence element TATATAA, which matches the consensus sequence for the `TATA' box of genes transcribed by RNA polymerase II (or B). The region between the `TATA'-like element and the transcription start site is not sufficient but may be required for specific transcription of the plastid gene. This latter region contains the sequence element TATACT, which resembles the prokaryotic `−10' (Pribnow) box. Based on the structural and transcriptional features of the 5' upstream region, a `promoter switch' mechanism is proposed, which may account for the developmentally regulated expression of this plastid gene. ImagesFig. 1.Fig. 2.Fig. 3.Fig. 4.Figure 5. PMID:16453540

  20. Evolution of bacterial-like phosphoprotein phosphatases in photosynthetic eukaryotes features ancestral mitochondrial or archaeal origin and possible lateral gene transfer.

    PubMed

    Uhrig, R Glen; Kerk, David; Moorhead, Greg B

    2013-12-01

    Protein phosphorylation is a reversible regulatory process catalyzed by the opposing reactions of protein kinases and phosphatases, which are central to the proper functioning of the cell. Dysfunction of members in either the protein kinase or phosphatase family can have wide-ranging deleterious effects in both metazoans and plants alike. Previously, three bacterial-like phosphoprotein phosphatase classes were uncovered in eukaryotes and named according to the bacterial sequences with which they have the greatest similarity: Shewanella-like (SLP), Rhizobiales-like (RLPH), and ApaH-like (ALPH) phosphatases. Utilizing the wealth of data resulting from recently sequenced complete eukaryotic genomes, we conducted database searching by hidden Markov models, multiple sequence alignment, and phylogenetic tree inference with Bayesian and maximum likelihood methods to elucidate the pattern of evolution of eukaryotic bacterial-like phosphoprotein phosphatase sequences, which are predominantly distributed in photosynthetic eukaryotes. We uncovered a pattern of ancestral mitochondrial (SLP and RLPH) or archaeal (ALPH) gene entry into eukaryotes, supplemented by possible instances of lateral gene transfer between bacteria and eukaryotes. In addition to the previously known green algal and plant SLP1 and SLP2 protein forms, a more ancestral third form (SLP3) was found in green algae. Data from in silico subcellular localization predictions revealed class-specific differences in plants likely to result in distinct functions, and for SLP sequences, distinctive and possibly functionally significant differences between plants and nonphotosynthetic eukaryotes. Conserved carboxyl-terminal sequence motifs with class-specific patterns of residue substitutions, most prominent in photosynthetic organisms, raise the possibility of complex interactions with regulatory proteins.

  1. A comparative study of family-specific protein-ligand complex affinity prediction based on random forest approach

    NASA Astrophysics Data System (ADS)

    Wang, Yu; Guo, Yanzhi; Kuang, Qifan; Pu, Xuemei; Ji, Yue; Zhang, Zhihang; Li, Menglong

    2015-04-01

    The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein-ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients ( R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.

  2. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data.

    PubMed

    Dunn, Joshua G; Weissman, Jonathan S

    2016-11-22

    Next-generation sequencing (NGS) informs many biological questions with unprecedented depth and nucleotide resolution. These assays have created a need for analytical tools that enable users to manipulate data nucleotide-by-nucleotide robustly and easily. Furthermore, because many NGS assays encode information jointly within multiple properties of read alignments - for example, in ribosome profiling, the locations of ribosomes are jointly encoded in alignment coordinates and length - analytical tools are often required to extract the biological meaning from the alignments before analysis. Many assay-specific pipelines exist for this purpose, but there remains a need for user-friendly, generalized, nucleotide-resolution tools that are not limited to specific experimental regimes or analytical workflows. Plastid is a Python library designed specifically for nucleotide-resolution analysis of genomics and NGS data. As such, Plastid is designed to extract assay-specific information from read alignments while retaining generality and extensibility to novel NGS assays. Plastid represents NGS and other biological data as arrays of values associated with genomic or transcriptomic positions, and contains configurable tools to convert data from a variety of sources to such arrays. Plastid also includes numerous tools to manipulate even discontinuous genomic features, such as spliced transcripts, with nucleotide precision. Plastid automatically handles conversion between genomic and feature-centric coordinates, accounting for splicing and strand, freeing users of burdensome accounting. Finally, Plastid's data models use consistent and familiar biological idioms, enabling even beginners to develop sophisticated analytical workflows with minimal effort. Plastid is a versatile toolkit that has been used to analyze data from multiple NGS assays, including RNA-seq, ribosome profiling, and DMS-seq. It forms the genomic engine of our ORF annotation tool, ORF-RATER, and is readily adapted to novel NGS assays. Examples, tutorials, and extensive documentation can be found at https://plastid.readthedocs.io .

  3. Fire detection system using random forest classification for image sequences of complex background

    NASA Astrophysics Data System (ADS)

    Kim, Onecue; Kang, Dong-Joong

    2013-06-01

    We present a fire alarm system based on image processing that detects fire accidents in various environments. To reduce false alarms that frequently appeared in earlier systems, we combined image features including color, motion, and blinking information. We specifically define the color conditions of fires in hue, saturation and value, and RGB color space. Fire features are represented as intensity variation, color mean and variance, motion, and image differences. Moreover, blinking fire features are modeled by using crossing patches. We propose an algorithm that classifies patches into fire or nonfire areas by using random forest supervised learning. We design an embedded surveillance device made with acrylonitrile butadiene styrene housing for stable fire detection in outdoor environments. The experimental results show that our algorithm works robustly in complex environments and is able to detect fires in real time.

  4. On the structural context and identification of enzyme catalytic residues.

    PubMed

    Chien, Yu-Tung; Huang, Shao-Wei

    2013-01-01

    Enzymes play important roles in most of the biological processes. Although only a small fraction of residues are directly involved in catalytic reactions, these catalytic residues are the most crucial parts in enzymes. The study of the fundamental and unique features of catalytic residues benefits the understanding of enzyme functions and catalytic mechanisms. In this work, we analyze the structural context of catalytic residues based on theoretical and experimental structure flexibility. The results show that catalytic residues have distinct structural features and context. Their neighboring residues, whether sequence or structure neighbors within specific range, are usually structurally more rigid than those of noncatalytic residues. The structural context feature is combined with support vector machine to identify catalytic residues from enzyme structure. The prediction results are better or comparable to those of recent structure-based prediction methods.

  5. Cancelable biometrics realization with multispace random projections.

    PubMed

    Teoh, Andrew Beng Jin; Yuang, Chong Tze

    2007-10-01

    Biometric characteristics cannot be changed; therefore, the loss of privacy is permanent if they are ever compromised. This paper presents a two-factor cancelable formulation, where the biometric data are distorted in a revocable but non-reversible manner by first transforming the raw biometric data into a fixed-length feature vector and then projecting the feature vector onto a sequence of random subspaces that were derived from a user-specific pseudorandom number (PRN). This process is revocable and makes replacing biometrics as easy as replacing PRNs. The formulation has been verified under a number of scenarios (normal, stolen PRN, and compromised biometrics scenarios) using 2400 Facial Recognition Technology face images. The diversity property is also examined.

  6. Quantum field theory and the linguistic Minimalist Program: a remarkable isomorphism

    NASA Astrophysics Data System (ADS)

    Piattelli-Palmarini, M.; Vitiello, G.

    2017-08-01

    By resorting to recent results, we show that an isomorphism exist between linguistic features of the Minimalist Program and the quantum field theory formalism of condensed matter physics. Specific linguistic features which admit a representation in terms of the many-body algebraic formalism are the unconstrained nature of recursive Merge, the operation of the Labeling Algorithm, the difference between pronounced and un-pronounced copies of elements in a sentence and the build-up of the Fibonacci sequence in the syntactic derivation of sentence structures. The collective dynamical nature of the formation process of Logical Forms leading to the individuation of the manifold of concepts and the computational self-consistency of languages are also discussed.

  7. Initial Characterization of the Pf-Int Recombinase from the Malaria Parasite Plasmodium falciparum

    PubMed Central

    Ghorbal, Mehdi; Scheidig-Benatar, Christine; Bouizem, Salma; Thomas, Christophe; Paisley, Genevieve; Faltermeier, Claire; Liu, Melanie; Scherf, Artur; Lopez-Rubio, Jose-Juan; Gopaul, Deshmukh N.

    2012-01-01

    Background Genetic variation is an essential means of evolution and adaptation in many organisms in response to environmental change. Certain DNA alterations can be carried out by site-specific recombinases (SSRs) that fall into two families: the serine and the tyrosine recombinases. SSRs are seldom found in eukaryotes. A gene homologous to a tyrosine site-specific recombinase has been identified in the genome of Plasmodium falciparum. The sequence is highly conserved among five other members of Plasmodia. Methodology/Principal Findings The predicted open reading frame encodes for a ∼57 kDa protein containing a C-terminal domain including the putative tyrosine recombinase conserved active site residues R-H-R-(H/W)-Y. The N-terminus has the typical alpha-helical bundle and potentially a mixed alpha-beta domain resembling that of λ-Int. Pf-Int mRNA is expressed differentially during the P. falciparum erythrocytic life stages, peaking in the schizont stage. Recombinant Pf-Int and affinity chromatography of DNA from genomic or synthetic origin were used to identify potential DNA targets after sequencing or micro-array hybridization. Interestingly, the sequences captured also included highly variable subtelomeric genes such as var, rif, and stevor sequences. Electrophoretic mobility shift assays with DNA were carried out to verify Pf-Int/DNA binding. Finally, Pf-Int knock-out parasites were created in order to investigate the biological role of Pf-Int. Conclusions/Significance Our data identify for the first time a malaria parasite gene with structural and functional features of recombinases. Pf-Int may bind to and alter DNA, either in a sequence specific or in a non-specific fashion, and may contribute to programmed or random DNA rearrangements. Pf-Int is the first molecular player identified with a potential role in genome plasticity in this pathogen. Finally, Pf-Int knock-out parasite is viable showing no detectable impact on blood stage development, which is compatible with such function. PMID:23056326

  8. A Segment of the Apospory-Specific Genomic Region Is Highly Microsyntenic Not Only between the Apomicts Pennisetum squamulatum and Buffelgrass, But Also with a Rice Chromosome 11 Centromeric-Proximal Genomic Region1[W

    PubMed Central

    Gualtieri, Gustavo; Conner, Joann A.; Morishige, Daryl T.; Moore, L. David; Mullet, John E.; Ozias-Akins, Peggy

    2006-01-01

    Bacterial artificial chromosome (BAC) clones from apomicts Pennisetum squamulatum and buffelgrass (Cenchrus ciliaris), isolated with the apospory-specific genomic region (ASGR) marker ugt197, were assembled into contigs that were extended by chromosome walking. Gene-like sequences from contigs were identified by shotgun sequencing and BLAST searches, and used to isolate orthologous rice contigs. Additional gene-like sequences in the apomicts' contigs were identified by bioinformatics using fully sequenced BACs from orthologous rice contigs as templates, as well as by interspecies, whole-contig cross-hybridizations. Hierarchical contig orthology was rapidly assessed by constructing detailed long-range contig molecular maps showing the distribution of gene-like sequences and markers, and searching for microsyntenic patterns of sequence identity and spatial distribution within and across species contigs. We found microsynteny between P. squamulatum and buffelgrass contigs. Importantly, this approach also enabled us to isolate from within the rice (Oryza sativa) genome contig Rice A, which shows the highest microsynteny and is most orthologous to the ugt197-containing C1C buffelgrass contig. Contig Rice A belongs to the rice genome database contig 77 (according to the current September 12, 2003, rice fingerprint contig build) that maps proximal to the chromosome 11 centromere, a feature that interestingly correlates with the mapping of ASGR-linked BACs proximal to the centromere or centromere-like sequences. Thus, relatedness between these two orthologous contigs is supported both by their molecular microstructure and by their centromeric-proximal location. Our discoveries promote the use of a microsynteny-based positional-cloning approach using the rice genome as a template to aid in constructing the ASGR toward the isolation of genes underlying apospory. PMID:16415213

  9. A segment of the apospory-specific genomic region is highly microsyntenic not only between the apomicts Pennisetum squamulatum and buffelgrass, but also with a rice chromosome 11 centromeric-proximal genomic region.

    PubMed

    Gualtieri, Gustavo; Conner, Joann A; Morishige, Daryl T; Moore, L David; Mullet, John E; Ozias-Akins, Peggy

    2006-03-01

    Bacterial artificial chromosome (BAC) clones from apomicts Pennisetum squamulatum and buffelgrass (Cenchrus ciliaris), isolated with the apospory-specific genomic region (ASGR) marker ugt197, were assembled into contigs that were extended by chromosome walking. Gene-like sequences from contigs were identified by shotgun sequencing and BLAST searches, and used to isolate orthologous rice contigs. Additional gene-like sequences in the apomicts' contigs were identified by bioinformatics using fully sequenced BACs from orthologous rice contigs as templates, as well as by interspecies, whole-contig cross-hybridizations. Hierarchical contig orthology was rapidly assessed by constructing detailed long-range contig molecular maps showing the distribution of gene-like sequences and markers, and searching for microsyntenic patterns of sequence identity and spatial distribution within and across species contigs. We found microsynteny between P. squamulatum and buffelgrass contigs. Importantly, this approach also enabled us to isolate from within the rice (Oryza sativa) genome contig Rice A, which shows the highest microsynteny and is most orthologous to the ugt197-containing C1C buffelgrass contig. Contig Rice A belongs to the rice genome database contig 77 (according to the current September 12, 2003, rice fingerprint contig build) that maps proximal to the chromosome 11 centromere, a feature that interestingly correlates with the mapping of ASGR-linked BACs proximal to the centromere or centromere-like sequences. Thus, relatedness between these two orthologous contigs is supported both by their molecular microstructure and by their centromeric-proximal location. Our discoveries promote the use of a microsynteny-based positional-cloning approach using the rice genome as a template to aid in constructing the ASGR toward the isolation of genes underlying apospory.

  10. A Bayesian Sampler for Optimization of Protein Domain Hierarchies

    PubMed Central

    2014-01-01

    Abstract The process of identifying and modeling functionally divergent subgroups for a specific protein domain class and arranging these subgroups hierarchically has, thus far, largely been done via manual curation. How to accomplish this automatically and optimally is an unsolved statistical and algorithmic problem that is addressed here via Markov chain Monte Carlo sampling. Taking as input a (typically very large) multiple-sequence alignment, the sampler creates and optimizes a hierarchy by adding and deleting leaf nodes, by moving nodes and subtrees up and down the hierarchy, by inserting or deleting internal nodes, and by redefining the sequences and conserved patterns associated with each node. All such operations are based on a probability distribution that models the conserved and divergent patterns defining each subgroup. When we view these patterns as sequence determinants of protein function, each node or subtree in such a hierarchy corresponds to a subgroup of sequences with similar biological properties. The sampler can be applied either de novo or to an existing hierarchy. When applied to 60 protein domains from multiple starting points in this way, it converged on similar solutions with nearly identical log-likelihood ratio scores, suggesting that it typically finds the optimal peak in the posterior probability distribution. Similarities and differences between independently generated, nearly optimal hierarchies for a given domain help distinguish robust from statistically uncertain features. Thus, a future application of the sampler is to provide confidence measures for various features of a domain hierarchy. PMID:24494927

  11. zUMIs - A fast and flexible pipeline to process RNA sequencing data with UMIs.

    PubMed

    Parekh, Swati; Ziegenhain, Christoph; Vieth, Beate; Enard, Wolfgang; Hellmann, Ines

    2018-06-01

    Single-cell RNA-sequencing (scRNA-seq) experiments typically analyze hundreds or thousands of cells after amplification of the cDNA. The high throughput is made possible by the early introduction of sample-specific bar codes (BCs), and the amplification bias is alleviated by unique molecular identifiers (UMIs). Thus, the ideal analysis pipeline for scRNA-seq data needs to efficiently tabulate reads according to both BC and UMI. zUMIs is a pipeline that can handle both known and random BCs and also efficiently collapse UMIs, either just for exon mapping reads or for both exon and intron mapping reads. If BC annotation is missing, zUMIs can accurately detect intact cells from the distribution of sequencing reads. Another unique feature of zUMIs is the adaptive downsampling function that facilitates dealing with hugely varying library sizes but also allows the user to evaluate whether the library has been sequenced to saturation. To illustrate the utility of zUMIs, we analyzed a single-nucleus RNA-seq dataset and show that more than 35% of all reads map to introns. Also, we show that these intronic reads are informative about expression levels, significantly increasing the number of detected genes and improving the cluster resolution. zUMIs flexibility makes if possible to accommodate data generated with any of the major scRNA-seq protocols that use BCs and UMIs and is the most feature-rich, fast, and user-friendly pipeline to process such scRNA-seq data.

  12. NetTurnP – Neural Network Prediction of Beta-turns by Use of Evolutionary Information and Predicted Protein Sequence Features

    PubMed Central

    Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl

    2010-01-01

    β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC  = 0.50, Qtotal = 82.1%, sensitivity  = 75.6%, PPV  = 68.8% and AUC  = 0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17 – 0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. Conclusion The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences. PMID:21152409

  13. NetTurnP--neural network prediction of beta-turns by use of evolutionary information and predicted protein sequence features.

    PubMed

    Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl

    2010-11-30

    β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC=0.50, Qtotal=82.1%, sensitivity=75.6%, PPV=68.8% and AUC=0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17-0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences.

  14. SIMAP—the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage

    PubMed Central

    Arnold, Roland; Goldenberg, Florian; Mewes, Hans-Werner; Rattei, Thomas

    2014-01-01

    The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads. PMID:24165881

  15. APRICOT: an integrated computational pipeline for the sequence-based identification and characterization of RNA-binding proteins.

    PubMed

    Sharan, Malvika; Förstner, Konrad U; Eulalio, Ana; Vogel, Jörg

    2017-06-20

    RNA-binding proteins (RBPs) have been established as core components of several post-transcriptional gene regulation mechanisms. Experimental techniques such as cross-linking and co-immunoprecipitation have enabled the identification of RBPs, RNA-binding domains (RBDs) and their regulatory roles in the eukaryotic species such as human and yeast in large-scale. In contrast, our knowledge of the number and potential diversity of RBPs in bacteria is poorer due to the technical challenges associated with the existing global screening approaches. We introduce APRICOT, a computational pipeline for the sequence-based identification and characterization of proteins using RBDs known from experimental studies. The pipeline identifies functional motifs in protein sequences using position-specific scoring matrices and Hidden Markov Models of the functional domains and statistically scores them based on a series of sequence-based features. Subsequently, APRICOT identifies putative RBPs and characterizes them by several biological properties. Here we demonstrate the application and adaptability of the pipeline on large-scale protein sets, including the bacterial proteome of Escherichia coli. APRICOT showed better performance on various datasets compared to other existing tools for the sequence-based prediction of RBPs by achieving an average sensitivity and specificity of 0.90 and 0.91 respectively. The command-line tool and its documentation are available at https://pypi.python.org/pypi/bio-apricot. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. Metavir 2: new tools for viral metagenome comparison and assembled virome analysis

    PubMed Central

    2014-01-01

    Background Metagenomics, based on culture-independent sequencing, is a well-fitted approach to provide insights into the composition, structure and dynamics of environmental viral communities. Following recent advances in sequencing technologies, new challenges arise for existing bioinformatic tools dedicated to viral metagenome (i.e. virome) analysis as (i) the number of viromes is rapidly growing and (ii) large genomic fragments can now be obtained by assembling the huge amount of sequence data generated for each metagenome. Results To face these challenges, a new version of Metavir was developed. First, all Metavir tools have been adapted to support comparative analysis of viromes in order to improve the analysis of multiple datasets. In addition to the sequence comparison previously provided, viromes can now be compared through their k-mer frequencies, their taxonomic compositions, recruitment plots and phylogenetic trees containing sequences from different datasets. Second, a new section has been specifically designed to handle assembled viromes made of thousands of large genomic fragments (i.e. contigs). This section includes an annotation pipeline for uploaded viral contigs (gene prediction, similarity search against reference viral genomes and protein domains) and an extensive comparison between contigs and reference genomes. Contigs and their annotations can be explored on the website through specifically developed dynamic genomic maps and interactive networks. Conclusions The new features of Metavir 2 allow users to explore and analyze viromes composed of raw reads or assembled fragments through a set of adapted tools and a user-friendly interface. PMID:24646187

  17. Predictive Bcl-2 Family Binding Models Rooted in Experiment or Structure

    PubMed Central

    DeBartolo, Joe; Dutta, Sanjib; Reich, Lothar; Keating, Amy E.

    2013-01-01

    Proteins of the Bcl-2 family either enhance or suppress programmed cell death and are centrally involved in cancer development and resistance to chemotherapy. BH3 (Bcl-2 homology 3)-only Bcl-2 proteins promote cell death by docking an α-helix into a hydrophobic groove on the surface of one or more of five pro-survival Bcl-2 receptor proteins. There is high structural homology within the pro-death and pro-survival families, yet a high degree of interaction specificity is nevertheless encoded, posing an interesting and important molecular recognition problem. Understanding protein features that dictate Bcl-2 interaction specificity is critical for designing peptide-based cancer therapeutics and diagnostics. In this study, we present peptide SPOT arrays and deep sequencing data from yeast display screening experiments that significantly expand the BH3 sequence space that has been experimentally tested for interaction with five human anti-apoptotic receptors. These data provide rich information about the determinants of Bcl-2 family specificity. To interpret and use the information, we constructed two simple data-based models that can predict affinity and specificity when evaluated on independent data sets within a limited sequence space. We also constructed a novel structure-based statistical potential, called STATIUM, which is remarkably good at predicting Bcl-2 affinity and specificity, especially considering it is not trained on experimental data. We compare the performance of our three models to each other and to alternative structure-based methods and discuss how such tools can guide prediction and design of new Bcl-2 family complexes. PMID:22617328

  18. A genomic view of food-related and probiotic Enterococcus strains.

    PubMed

    Bonacina, Julieta; Suárez, Nadia; Hormigo, Ricardo; Fadda, Silvina; Lechner, Marcus; Saavedra, Lucila

    2017-02-01

    The study of enterococcal genomes has grown considerably in recent years. While special attention is paid to comparative genomic analysis among clinical relevant isolates, in this study we performed an exhaustive comparative analysis of enterococcal genomes of food origin and/or with potential to be used as probiotics. Beyond common genetic features, we especially aimed to identify those that are specific to enterococcal strains isolated from a certain food-related source as well as features present in a species-specific manner. Thus, the genome sequences of 25 Enterococcus strains, from 7 different species, were examined and compared. Their phylogenetic relationship was reconstructed based on orthologous proteins and whole genomes. Likewise, markers associated with a successful colonization (bacteriocin genes and genomic islands) and genome plasticity (phages and clustered regularly interspaced short palindromic repeats) were investigated for lifestyle specific genetic features. At the same time, a search for antibiotic resistance genes was carried out, since they are of big concern in the food industry. Finally, it was possible to locate 1617 FIGfam families as a core proteome universally present among the genera and to determine that most of the accessory genes code for hypothetical proteins, providing reasonable hints to support their functional characterization. © The Author 2016. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  19. Robot Acting on Moving Bodies (RAMBO): Interaction with tumbling objects

    NASA Technical Reports Server (NTRS)

    Davis, Larry S.; Dementhon, Daniel; Bestul, Thor; Ziavras, Sotirios; Srinivasan, H. V.; Siddalingaiah, Madhu; Harwood, David

    1989-01-01

    Interaction with tumbling objects will become more common as human activities in space expand. Attempting to interact with a large complex object translating and rotating in space, a human operator using only his visual and mental capacities may not be able to estimate the object motion, plan actions or control those actions. A robot system (RAMBO) equipped with a camera, which, given a sequence of simple tasks, can perform these tasks on a tumbling object, is being developed. RAMBO is given a complete geometric model of the object. A low level vision module extracts and groups characteristic features in images of the object. The positions of the object are determined in a sequence of images, and a motion estimate of the object is obtained. This motion estimate is used to plan trajectories of the robot tool to relative locations rearby the object sufficient for achieving the tasks. More specifically, low level vision uses parallel algorithms for image enhancement by symmetric nearest neighbor filtering, edge detection by local gradient operators, and corner extraction by sector filtering. The object pose estimation is a Hough transform method accumulating position hypotheses obtained by matching triples of image features (corners) to triples of model features. To maximize computing speed, the estimate of the position in space of a triple of features is obtained by decomposing its perspective view into a product of rotations and a scaled orthographic projection. This allows use of 2-D lookup tables at each stage of the decomposition. The position hypotheses for each possible match of model feature triples and image feature triples are calculated in parallel. Trajectory planning combines heuristic and dynamic programming techniques. Then trajectories are created using dynamic interpolations between initial and goal trajectories. All the parallel algorithms run on a Connection Machine CM-2 with 16K processors.

  20. A smart phone-based pocket fall accident detection, positioning, and rescue system.

    PubMed

    Kau, Lih-Jen; Chen, Chih-Sheng

    2015-01-01

    We propose in this paper a novel algorithm as well as architecture for the fall accident detection and corresponding wide area rescue system based on a smart phone and the third generation (3G) networks. To realize the fall detection algorithm, the angles acquired by the electronic compass (ecompass) and the waveform sequence of the triaxial accelerometer on the smart phone are used as the system inputs. The acquired signals are then used to generate an ordered feature sequence and then examined in a sequential manner by the proposed cascade classifier for recognition purpose. Once the corresponding feature is verified by the classifier at current state, it can proceed to next state; otherwise, the system will reset to the initial state and wait for the appearance of another feature sequence. Once a fall accident event is detected, the user's position can be acquired by the global positioning system (GPS) or the assisted GPS, and sent to the rescue center via the 3G communication network so that the user can get medical help immediately. With the proposed cascaded classification architecture, the computational burden and power consumption issue on the smart phone system can be alleviated. Moreover, as we will see in the experiment that a distinguished fall accident detection accuracy up to 92% on the sensitivity and 99.75% on the specificity can be obtained when a set of 450 test actions in nine different kinds of activities are estimated by using the proposed cascaded classifier, which justifies the superiority of the proposed algorithm.

  1. Do pattern recognition skills transfer across sports? A preliminary analysis.

    PubMed

    Smeeton, Nicholas J; Ward, Paul; Williams, A Mark

    2004-02-01

    The ability to recognize patterns of play is fundamental to performance in team sports. While typically assumed to be domain-specific, pattern recognition skills may transfer from one sport to another if similarities exist in the perceptual features and their relations and/or the strategies used to encode and retrieve relevant information. A transfer paradigm was employed to compare skilled and less skilled soccer, field hockey and volleyball players' pattern recognition skills. Participants viewed structured and unstructured action sequences from each sport, half of which were randomly represented with clips not previously seen. The task was to identify previously viewed action sequences quickly and accurately. Transfer of pattern recognition skill was dependent on the participant's skill, sport practised, nature of the task and degree of structure. The skilled soccer and hockey players were quicker than the skilled volleyball players at recognizing structured soccer and hockey action sequences. Performance differences were not observed on the structured volleyball trials between the skilled soccer, field hockey and volleyball players. The skilled field hockey and soccer players were able to transfer perceptual information or strategies between their respective sports. The less skilled participants' results were less clear. Implications for domain-specific expertise, transfer and diversity across domains are discussed.

  2. Exome Sequencing Discerns Syndromes in Patients from Consanguineous Families with Congenital Anomalies of the Kidneys and Urinary Tract

    PubMed Central

    Vivante, Asaf; Hwang, Daw-Yang; Kohl, Stefan; Chen, Jing; Shril, Shirlee; Schulz, Julian; van der Ven, Amelie; Daouk, Ghaleb; Soliman, Neveen A.; Kumar, Aravind Selvin; Senguttuvan, Prabha; Kehinde, Elijah O.; Tasic, Velibor

    2017-01-01

    Congenital anomalies of the kidneys and urinary tract (CAKUT) are the leading cause of CKD in children, featuring a broad variety of malformations. A monogenic cause can be detected in around 12% of patients. However, the morphologic clinical phenotype of CAKUT frequently does not indicate specific genes to be examined. To determine the likelihood of detecting causative recessive mutations by whole-exome sequencing (WES), we analyzed individuals with CAKUT from 33 different consanguineous families. Using homozygosity mapping and WES, we identified the causative mutations in nine of the 33 families studied (27%). We detected recessive mutations in nine known disease–causing genes: ZBTB24, WFS1, HPSE2, ATRX, ASPH, AGXT, AQP2, CTNS, and PKHD1. Notably, when mutated, these genes cause multiorgan syndromes that may include CAKUT as a feature (syndromic CAKUT) or cause renal diseases that may manifest as phenocopies of CAKUT. None of the above monogenic disease–causing genes were suspected on clinical grounds before this study. Follow-up clinical characterization of those patients allowed us to revise and detect relevant new clinical features in a more appropriate pathogenetic context. Thus, applying WES to the diagnostic approach in CAKUT provides opportunities for an accurate and early etiology–based diagnosis and improved clinical management. PMID:27151922

  3. Accelerating next generation sequencing data analysis with system level optimizations.

    PubMed

    Kathiresan, Nagarajan; Temanni, Ramzi; Almabrazi, Hakeem; Syed, Najeeb; Jithesh, Puthen V; Al-Ali, Rashid

    2017-08-22

    Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

  4. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.

    PubMed

    Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio

    2017-10-06

    Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.

  5. The impact of visual sequencing of graphic symbols on the sentence construction output of children who have acquired language.

    PubMed

    Alant, Erna; du Plooy, Amelia; Dada, Shakila

    2007-01-01

    Although the sequence of graphic or pictorial symbols displayed on a communication board can have an impact on the language output of children, very little research has been conducted to describe this. Research in this area is particularly relevant for prioritising the importance of specific visual and graphic features in providing more effective and user-friendly access to communication boards. This study is concerned with understanding the impact ofspecific sequences of graphic symbol input on the graphic and spoken output of children who have acquired language. Forty participants were divided into two comparable groups. Each group was exposed to graphic symbol input with a certain word order sequence. The structure of input was either in typical English word order sequence Subject- Verb-Object (SVO) or in the word order sequence of Subject-Object-Verb (SOV). Both input groups had to answer six questions by using graphic output as well as speech. The findings indicated that there are significant differences in the PCS graphic output patterns of children who are exposed to graphic input in the SOV and SVO sequences. Furthermore, the output produced in the graphic mode differed considerably to the output produced in the spoken mode. Clinical implications of these findings are discussed

  6. Gene finding in metatranscriptomic sequences.

    PubMed

    Ismail, Wazim Mohammed; Ye, Yuzhen; Tang, Haixu

    2014-01-01

    Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics. In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript. We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community. TransGeneScan is available as open-source software on SourceForge at https://sourceforge.net/projects/transgenescan/.

  7. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  8. Prediction of site-specific interactions in antibody-antigen complexes: the proABC method and server.

    PubMed

    Olimpieri, Pier Paolo; Chailyan, Anna; Tramontano, Anna; Marcatili, Paolo

    2013-09-15

    Antibodies or immunoglobulins are proteins of paramount importance in the immune system. They are extremely relevant as diagnostic, biotechnological and therapeutic tools. Their modular structure makes it easy to re-engineer them for specific purposes. Short of undergoing a trial and error process, these experiments, as well as others, need to rely on an understanding of the specific determinants of the antibody binding mode. In this article, we present a method to identify, on the basis of the antibody sequence alone, which residues of an antibody directly interact with its cognate antigen. The method, based on the random forest automatic learning techniques, reaches a recall and specificity as high as 80% and is implemented as a free and easy-to-use server, named prediction of Antibody Contacts. We believe that it can be of great help in re-design experiments as well as a guide for molecular docking experiments. The results that we obtained also allowed us to dissect which features of the antibody sequence contribute most to the involvement of specific residues in binding to the antigen. http://www.biocomputing.it/proABC. anna.tramontano@uniroma1.it or paolo.marcatili@gmail.com Supplementary data are available at Bioinformatics online.

  9. Human lactoferricin derived di-peptides deploying loop structures induce apoptosis specifically in cancer cells through targeting membranous phosphatidylserine.

    PubMed

    Riedl, Sabrina; Leber, Regina; Rinner, Beate; Schaider, Helmut; Lohner, Karl; Zweytick, Dagmar

    2015-11-01

    Host defense-derived peptides have emerged as a novel strategy for the development of alternative anticancer therapies. In this study we report on characteristic features of human lactoferricin (hLFcin) derivatives which facilitate specific killing of cancer cells of melanoma, glioblastoma and rhabdomyosarcoma compared with non-specific derivatives and the synthetic peptide RW-AH. Changes in amino acid sequence of hLFcin providing 9-11 amino acids stretched derivatives LF11-316, -318 and -322 only yielded low antitumor activity. However, the addition of the repeat (di-peptide) and the retro-repeat (di-retro-peptide) sequences highly improved cancer cell toxicity up to 100% at 20 μM peptide concentration. Compared to the complete parent sequence hLFcin the derivatives showed toxicity on the melanoma cell line A375 increased by 10-fold and on the glioblastoma cell line U-87mg by 2-3-fold. Reduced killing velocity, apoptotic blebbing, activation of caspase 3/7 and formation of apoptotic DNA fragments proved that the active and cancer selective peptides, e.g. R-DIM-P-LF11-322, trigger apoptosis, whereas highly active, though non-selective peptides, such as DIM-LF11-318 and RW-AH seem to kill rapidly via necrosis inducing membrane lyses. Structural studies revealed specific toxicity on cancer cells by peptide derivatives with loop structures, whereas non-specific peptides comprised α-helical structures without loop. Model studies with the cancer membrane mimic phosphatidylserine (PS) gave strong evidence that PS only exposed by cancer cells is an important target for specific hLFcin derivatives. Other negatively charged membrane exposed molecules as sialic acid, heparan and chondroitin sulfate were shown to have minor impact on peptide activity. Copyright © 2015. Published by Elsevier B.V.

  10. Predicting protein-protein interactions by combing various sequence- derived features into the general form of Chou's Pseudo amino acid composition.

    PubMed

    Zhao, Xiao-Wei; Ma, Zhi-Qiang; Yin, Ming-Hao

    2012-05-01

    Knowledge of protein-protein interactions (PPIs) plays an important role in constructing protein interaction networks and understanding the general machineries of biological systems. In this study, a new method is proposed to predict PPIs using a comprehensive set of 930 features based only on sequence information, these features measure the interactions between residues a certain distant apart in the protein sequences from different aspects. To achieve better performance, the principal component analysis (PCA) is first employed to obtain an optimized feature subset. Then, the resulting 67-dimensional feature vectors are fed to Support Vector Machine (SVM). Experimental results on Drosophila melanogaster and Helicobater pylori datasets show that our method is very promising to predict PPIs and may at least be a useful supplement tool to existing methods.

  11. Detecting novel genes with sparse arrays

    PubMed Central

    Haiminen, Niina; Smit, Bart; Rautio, Jari; Vitikainen, Marika; Wiebe, Marilyn; Martinez, Diego; Chee, Christine; Kunkel, Joe; Sanchez, Charles; Nelson, Mary Anne; Pakula, Tiina; Saloheimo, Markku; Penttilä, Merja; Kivioja, Teemu

    2014-01-01

    Species-specific genes play an important role in defining the phenotype of an organism. However, current gene prediction methods can only efficiently find genes that share features such as sequence similarity or general sequence characteristics with previously known genes. Novel sequencing methods and tiling arrays can be used to find genes without prior information and they have demonstrated that novel genes can still be found from extensively studied model organisms. Unfortunately, these methods are expensive and thus are not easily applicable, e.g., to finding genes that are expressed only in very specific conditions. We demonstrate a method for finding novel genes with sparse arrays, applying it on the 33.9 Mb genome of the filamentous fungus Trichoderma reesei. Our computational method does not require normalisations between arrays and it takes into account the multiple-testing problem typical for analysis of microarray data. In contrast to tiling arrays, that use overlapping probes, only one 25mer microarray oligonucleotide probe was used for every 100 b. Thus, only relatively little space on a microarray slide was required to cover the intergenic regions of a genome. The analysis was done as a by-product of a conventional microarray experiment with no additional costs. We found at least 23 good candidates for novel transcripts that could code for proteins and all of which were expressed at high levels. Candidate genes were found to neighbour ire1 and cre1 and many other regulatory genes. Our simple, low-cost method can easily be applied to finding novel species-specific genes without prior knowledge of their sequence properties. PMID:20691772

  12. Robust sensorimotor representation to physical interaction changes in humanoid motion learning.

    PubMed

    Shimizu, Toshihiko; Saegusa, Ryo; Ikemoto, Shuhei; Ishiguro, Hiroshi; Metta, Giorgio

    2015-05-01

    This paper proposes a learning from demonstration system based on a motion feature, called phase transfer sequence. The system aims to synthesize the knowledge on humanoid whole body motions learned during teacher-supported interactions, and apply this knowledge during different physical interactions between a robot and its surroundings. The phase transfer sequence represents the temporal order of the changing points in multiple time sequences. It encodes the dynamical aspects of the sequences so as to absorb the gaps in timing and amplitude derived from interaction changes. The phase transfer sequence was evaluated in reinforcement learning of sitting-up and walking motions conducted by a real humanoid robot and compatible simulator. In both tasks, the robotic motions were less dependent on physical interactions when learned by the proposed feature than by conventional similarity measurements. Phase transfer sequence also enhanced the convergence speed of motion learning. Our proposed feature is original primarily because it absorbs the gaps caused by changes of the originally acquired physical interactions, thereby enhancing the learning speed in subsequent interactions.

  13. Sequence-specific 1H-NMR assignments for the aromatic region of several biologically active, monomeric insulins including native human insulin.

    PubMed

    Roy, M; Lee, R W; Kaarsholm, N C; Thøgersen, H; Brange, J; Dunn, M F

    1990-06-12

    The aromatic region of the 1H-FT-NMR spectrum of the biologically fully-potent, monomeric human insulin mutant, B9 Ser----Asp, B27 Thr----Glu has been investigated in D2O. At 1 to 5 mM concentrations, this mutant insulin is monomeric above pH 7.5. Coupling and amino acid classification of all aromatic signals is established via a combination of homonuclear one- and two-dimensional methods, including COSY, multiple quantum filters, selective spin decoupling and pH titrations. By comparisons with other insulin mutants and with chemically modified native insulins, all resonances in the aromatic region are given sequence-specific assignments without any reliance on the various crystal structures reported for insulin. These comparisons also give the sequence-specific assignments of most of the aromatic resonances of the mutant insulins B16 Tyr----Glu, B27 Thr----Glu and B25 Phe----Asp and the chemically modified species des-(B23-B30) insulin and monoiodo-Tyr A14 insulin. Chemical dispersion of the assigned resonances, ring current perturbations and comparisons at high pH have made possible the assignment of the aromatic resonances of human insulin, and these studies indicate that the major structural features of the human insulin monomer (including those critical to biological function) are also present in the monomeric mutant.

  14. A frontal but not parietal neural correlate of auditory consciousness.

    PubMed

    Brancucci, Alfredo; Lugli, Victor; Perrucci, Mauro Gianni; Del Gratta, Cosimo; Tommasi, Luca

    2016-01-01

    Hemodynamic correlates of consciousness were investigated in humans during the presentation of a dichotic sequence inducing illusory auditory percepts with features analogous to visual multistability. The sequence consisted of a variation of the original stimulation eliciting the Deutsch's octave illusion, created to maintain a stable illusory percept long enough to allow the detection of the underlying hemodynamic activity using functional magnetic resonance imaging (fMRI). Two specular 500 ms dichotic stimuli (400 and 800 Hz) presented in alternation by means of earphones cause an illusory segregation of pitch and ear of origin which can yield up to four different auditory percepts per dichotic stimulus. Such percepts are maintained stable when one of the two dichotic stimuli is presented repeatedly for 6 s, immediately after the alternation. We observed hemodynamic activity specifically accompanying conscious experience of pitch in a bilateral network including the superior frontal gyrus (SFG, BA9 and BA10), medial frontal gyrus (BA6 and BA9), insula (BA13), and posterior lateral nucleus of the thalamus. Conscious experience of side (ear of origin) was instead specifically accompanied by bilateral activity in the MFG (BA6), STG (BA41), parahippocampal gyrus (BA28), and insula (BA13). These results suggest that the neural substrate of auditory consciousness, differently from that of visual consciousness, may rest upon a fronto-temporal rather than upon a fronto-parietal network. Moreover, they indicate that the neural correlates of consciousness depend on the specific features of the stimulus and suggest the SFG-MFG and the insula as important cortical nodes for auditory conscious experience.

  15. WebLogo: A Sequence Logo Generator

    PubMed Central

    Crooks, Gavin E.; Hon, Gary; Chandonia, John-Marc; Brenner, Steven E.

    2004-01-01

    WebLogo generates sequence logos, graphical representations of the patterns within a multiple sequence alignment. Sequence logos provide a richer and more precise description of sequence similarity than consensus sequences and can rapidly reveal significant features of the alignment otherwise difficult to perceive. Each logo consists of stacks of letters, one stack for each position in the sequence. The overall height of each stack indicates the sequence conservation at that position (measured in bits), whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino or nucleic acid at that position. WebLogo has been enhanced recently with additional features and options, to provide a convenient and highly configurable sequence logo generator. A command line interface and the complete, open WebLogo source code are available for local installation and customization. PMID:15173120

  16. Features of the bronchial bacterial microbiome associated with atopy, asthma, and responsiveness to inhaled corticosteroid treatment.

    PubMed

    Durack, Juliana; Lynch, Susan V; Nariya, Snehal; Bhakta, Nirav R; Beigelman, Avraham; Castro, Mario; Dyer, Anne-Marie; Israel, Elliot; Kraft, Monica; Martin, Richard J; Mauger, David T; Rosenberg, Sharon R; Sharp-King, Tonya; White, Steven R; Woodruff, Prescott G; Avila, Pedro C; Denlinger, Loren C; Holguin, Fernando; Lazarus, Stephen C; Lugogo, Njira; Moore, Wendy C; Peters, Stephen P; Que, Loretta; Smith, Lewis J; Sorkness, Christine A; Wechsler, Michael E; Wenzel, Sally E; Boushey, Homer A; Huang, Yvonne J

    2017-07-01

    Compositional differences in the bronchial bacterial microbiota have been associated with asthma, but it remains unclear whether the findings are attributable to asthma, to aeroallergen sensitization, or to inhaled corticosteroid treatment. We sought to compare the bronchial bacterial microbiota in adults with steroid-naive atopic asthma, subjects with atopy but no asthma, and nonatopic healthy control subjects and to determine relationships of the bronchial microbiota to phenotypic features of asthma. Bacterial communities in protected bronchial brushings from 42 atopic asthmatic subjects, 21 subjects with atopy but no asthma, and 21 healthy control subjects were profiled by using 16S rRNA gene sequencing. Bacterial composition and community-level functions inferred from sequence profiles were analyzed for between-group differences. Associations with clinical and inflammatory variables were examined, including markers of type 2-related inflammation and change in airway hyperresponsiveness after 6 weeks of fluticasone treatment. The bronchial microbiome differed significantly among the 3 groups. Asthmatic subjects were uniquely enriched in members of the Haemophilus, Neisseria, Fusobacterium, and Porphyromonas species and the Sphingomonodaceae family and depleted in members of the Mogibacteriaceae family and Lactobacillales order. Asthma-associated differences in predicted bacterial functions included involvement of amino acid and short-chain fatty acid metabolism pathways. Subjects with type 2-high asthma harbored significantly lower bronchial bacterial burden. Distinct changes in specific microbiota members were seen after fluticasone treatment. Steroid responsiveness was linked to differences in baseline compositional and functional features of the bacterial microbiome. Even in subjects with mild steroid-naive asthma, differences in the bronchial microbiome are associated with immunologic and clinical features of the disease. The specific differences identified suggest possible microbiome targets for future approaches to asthma treatment or prevention. Copyright © 2016 American Academy of Allergy, Asthma & Immunology. Published by Elsevier Inc. All rights reserved.

  17. Properties of genes essential for mouse development

    PubMed Central

    Kabir, Mitra; Barradas, Ana; Tzotzos, George T.; Hentges, Kathryn E.

    2017-01-01

    Essential genes are those that are critical for life. In the specific case of the mouse, they are the set of genes whose deletion means that a mouse is unable to survive after birth. As such, they are the key minimal set of genes needed for all the steps of development to produce an organism capable of life ex utero. We explored a wide range of sequence and functional features to characterise essential (lethal) and non-essential (viable) genes in mice. Experimental data curated manually identified 1301 essential genes and 3451 viable genes. Very many sequence features show highly significant differences between essential and viable mouse genes. Essential genes generally encode complex proteins, with multiple domains and many introns. These genes tend to be: long, highly expressed, old and evolutionarily conserved. These genes tend to encode ligases, transferases, phosphorylated proteins, intracellular proteins, nuclear proteins, and hubs in protein-protein interaction networks. They are involved with regulating protein-protein interactions, gene expression and metabolic processes, cell morphogenesis, cell division, cell proliferation, DNA replication, cell differentiation, DNA repair and transcription, cell differentiation and embryonic development. Viable genes tend to encode: membrane proteins or secreted proteins, and are associated with functions such as cellular communication, apoptosis, behaviour and immune response, as well as housekeeping and tissue specific functions. Viable genes are linked to transport, ion channels, signal transduction, calcium binding and lipid binding, consistent with their location in membranes and involvement with cell-cell communication. From the analysis of the composite features of essential and viable genes, we conclude that essential genes tend to be required for intracellular functions, and viable genes tend to be involved with extracellular functions and cell-cell communication. Knowledge of the features that are over-represented in essential genes allows for a deeper understanding of the functions and processes implemented during mammalian development. PMID:28562614

  18. Monoterpene and sesquiterpene synthases and the origin of terpene skeletal diversity in plants.

    PubMed

    Degenhardt, Jörg; Köllner, Tobias G; Gershenzon, Jonathan

    2009-01-01

    The multitude of terpene carbon skeletons in plants is formed by enzymes known as terpene synthases. This review covers the monoterpene and sesquiterpene synthases presenting an up-to-date list of enzymes reported and evidence for their ability to form multiple products. The reaction mechanisms of these enzyme classes are described, and information on how terpene synthase proteins mediate catalysis is summarized. Correlations between specific amino acid motifs and terpene synthase function are described, including an analysis of the relationships between active site sequence and cyclization type and a discussion of whether specific protein features might facilitate multiple product formation.

  19. Recent Progress in Aptamer-Based Functional Probes for Bioanalysis and Biomedicine.

    PubMed

    Zhang, Huimin; Zhou, Leiji; Zhu, Zhi; Yang, Chaoyong

    2016-07-11

    Nucleic acid aptamers are short synthetic DNA or RNA sequences that can bind to a wide range of targets with high affinity and specificity. In recent years, aptamers have attracted increasing research interest due to their unique features of high binding affinity and specificity, small size, excellent chemical stability, easy chemical synthesis, facile modification, and minimal immunogenicity. These properties make aptamers ideal recognition ligands for bioanalysis, disease diagnosis, and cancer therapy. This review highlights the recent progress in aptamer selection and the latest applications of aptamer-based functional probes in the fields of bioanalysis and biomedicine. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  20. IL-4 function can be transferred to the IL-2 receptor by tyrosine containing sequences found in the IL-4 receptor alpha chain.

    PubMed

    Wang, H Y; Paul, W E; Keegan, A D

    1996-02-01

    IL-4 binds to a cell surface receptor complex that consists of the IL-4 binding protein (IL-4R alpha) and the gamma chain of the IL-2 receptor complex (gamma c). The receptors for IL-4 and IL-2 have several features in common; both use the gamma c as a receptor component, and both activate the Janus kinases JAK-1 and JAK-3. In spite of these similarities, IL-4 evokes specific responses, including the tyrosine phosphorylation of 4PS/IRS-2 and the induction of CD23. To determine whether sequences within the cytoplasmic domain of the IL-4R alpha specify these IL-4-specific responses, we transplanted the insulin IL-4 receptor motif (I4R motif) of the huIL-4R alpha to the cytoplasmic domain of a truncated IL-2R beta. In addition, we transplanted a region that contains peptide sequences shown to block Stat6 binding to DNA. We analyzed the ability of cells expressing these IL-2R-IL-4R chimeric constructs to respond to IL-2. We found that IL-4 function could be transplanted to the IL-2 receptor by these regions and that proliferative and differentiative functions can be induced by different receptor sequences.

  1. Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine

    NASA Astrophysics Data System (ADS)

    Xing, Pengwei; Su, Ran; Guo, Fei; Wei, Leyi

    2017-04-01

    N6-methyladenosine (m6A) refers to methylation of the adenosine nucleotide acid at the nitrogen-6 position. It plays an important role in a series of biological processes, such as splicing events, mRNA exporting, nascent mRNA synthesis, nuclear translocation and translation process. Numerous experiments have been done to successfully characterize m6A sites within sequences since high-resolution mapping of m6A sites was established. However, as the explosive growth of genomic sequences, using experimental methods to identify m6A sites are time-consuming and expensive. Thus, it is highly desirable to develop fast and accurate computational identification methods. In this study, we propose a sequence-based predictor called RAM-NPPS for identifying m6A sites within RNA sequences, in which we present a novel feature representation algorithm based on multi-interval nucleotide pair position specificity, and use support vector machine classifier to construct the prediction model. Comparison results show that our proposed method outperforms the state-of-the-art predictors on three benchmark datasets across the three species, indicating the effectiveness and robustness of our method. Moreover, an online webserver implementing the proposed predictor has been established at http://server.malab.cn/RAM-NPPS/. It is anticipated to be a useful prediction tool to assist biologists to reveal the mechanisms of m6A site functions.

  2. Evolution of glutamine amidotransferase genes. Nucleotide sequences of the pabA genes from Salmonella typhimurium, Klebsiella aerogenes and Serratia marcescens.

    PubMed

    Kaplan, J B; Merkel, W K; Nichols, B P

    1985-06-05

    The amide group of glutamine is a source of nitrogen in the biosynthesis of a variety of compounds. These reactions are catalyzed by a group of enzymes known as glutamine amidotransferases; two of these, the glutamine amidotransferase subunits of p-aminobenzoate synthase and anthranilate synthase have been studied in detail and have been shown to be structurally and functionally related. In some micro-organisms, p-aminobenzoate synthase and anthranilate synthase share a common glutamine amidotransferase subunit. We report here the primary DNA and deduced amino acid sequences of the p-aminobenzoate synthase glutamine amidotransferase subunits from Salmonella typhimurium, Klebsiella aerogenes and Serratia marcescens. A comparison of these glutamine amidotransferase sequences to the sequences of ten others, including some that function specifically in either the p-aminobenzoate synthase or anthranilate synthase complexes and some that are shared by both synthase complexes, has revealed several interesting features of the structure and organization of these genes, and has allowed us to speculate as to the evolutionary history of this family of enzymes. We propose a model for the evolution of the p-aminobenzoate synthase and anthranilate synthase glutamine amidotransferase subunits in which the duplication and subsequent divergence of the genetic information encoding a shared glutamine amidotransferase subunit led to the evolution of two new pathway-specific enzymes.

  3. The Function of Embryonic Stem Cell-expressed RAS (E-RAS), a Unique RAS Family Member, Correlates with Its Additional Motifs and Its Structural Properties.

    PubMed

    Nakhaei-Rad, Saeideh; Nakhaeizadeh, Hossein; Kordes, Claus; Cirstea, Ion C; Schmick, Malte; Dvorsky, Radovan; Bastiaens, Philippe I H; Häussinger, Dieter; Ahmadian, Mohammad Reza

    2015-06-19

    E-RAS is a member of the RAS family specifically expressed in embryonic stem cells, gastric tumors, and hepatic stellate cells. Unlike classical RAS isoforms (H-, N-, and K-RAS4B), E-RAS has, in addition to striking and remarkable sequence deviations, an extended 38-amino acid-long unique N-terminal region with still unknown functions. We investigated the molecular mechanism of E-RAS regulation and function with respect to its sequence and structural features. We found that N-terminal extension of E-RAS is important for E-RAS signaling activity. E-RAS protein most remarkably revealed a different mode of effector interaction as compared with H-RAS, which correlates with deviations in the effector-binding site of E-RAS. Of all these residues, tryptophan 79 (arginine 41 in H-RAS), in the interswitch region, modulates the effector selectivity of RAS proteins from H-RAS to E-RAS features. © 2015 by The American Society for Biochemistry and Molecular Biology, Inc.

  4. Utility of fat-suppressed sequences in differentiation of aggressive vs typical asymptomatic haemangioma of the spine.

    PubMed

    Nabavizadeh, Seyed Ali; Mamourian, Alexander; Schmitt, James E; Cloran, Francis; Vossough, Arastoo; Pukenas, Bryan; Loevner, Laurie A; Mohan, Suyash

    2016-01-01

    While haemangiomas are common benign vascular lesions involving the spine, some behave in an aggressive fashion. We investigated the utility of fat-suppressed sequences to differentiate between benign and aggressive vertebral haemangiomas. Patients with the diagnosis of aggressive vertebral haemangioma and available short tau inversion-recovery or T2 fat saturation sequence were included in the study. 11 patients with typical asymptomatic vertebral body haemangiomas were selected as the control group. Region of interest signal intensity (SI) analysis of the entire haemangioma as well as the portion of each haemangioma with highest signal on fat-saturation sequences was performed and normalized to a reference normal vertebral body. A total of 8 patients with aggressive vertebral haemangioma and 11 patients with asymptomatic typical vertebral haemangioma were included. There was a significant difference between total normalized mean SI ratio (3.14 vs 1.48, p = 0.0002), total normalized maximum SI ratio (5.72 vs 2.55, p = 0.0003), brightest normalized mean SI ratio (4.28 vs 1.72, p < 0.0001) and brightest normalized maximum SI ratio (5.25 vs 2.45, p = 0.0003). Multiple measures were able to discriminate between groups with high sensitivity (>88%) and specificity (>82%). In addition to the conventional imaging features such as vertebral expansion and presence of extravertebral component, quantitative evaluation of fat-suppression sequences is also another imaging feature that can differentiate aggressive haemangioma and typical asymptomatic haemangioma. The use of quantitative fat-suppressed MRI in vertebral haemangiomas is demonstrated. Quantitative fat-suppressed MRI can have a role in confirming the diagnosis of aggressive haemangiomas. In addition, this application can be further investigated in future studies to predict aggressiveness of vertebral haemangiomas in early stages.

  5. Feature selection using a one dimensional naïve Bayes' classifier increases the accuracy of support vector machine classification of CDR3 repertoires.

    PubMed

    Cinelli, Mattia; Sun, Yuxin; Best, Katharine; Heather, James M; Reich-Zeliger, Shlomit; Shifrut, Eric; Friedman, Nir; Shawe-Taylor, John; Chain, Benny

    2017-04-01

    Somatic DNA recombination, the hallmark of vertebrate adaptive immunity, has the potential to generate a vast diversity of antigen receptor sequences. How this diversity captures antigen specificity remains incompletely understood. In this study we use high throughput sequencing to compare the global changes in T cell receptor β chain complementarity determining region 3 (CDR3β) sequences following immunization with ovalbumin administered with complete Freund's adjuvant (CFA) or CFA alone. The CDR3β sequences were deconstructed into short stretches of overlapping contiguous amino acids. The motifs were ranked according to a one-dimensional Bayesian classifier score comparing their frequency in the repertoires of the two immunization classes. The top ranking motifs were selected and used to create feature vectors which were used to train a support vector machine. The support vector machine achieved high classification scores in a leave-one-out validation test reaching >90% in some cases. The study describes a novel two-stage classification strategy combining a one-dimensional Bayesian classifier with a support vector machine. Using this approach we demonstrate that the frequency of a small number of linear motifs three amino acids in length can accurately identify a CD4 T cell response to ovalbumin against a background response to the complex mixture of antigens which characterize Complete Freund's Adjuvant. The sequence data is available at www.ncbi.nlm.nih.gov/sra/?term¼SRP075893 . The Decombinator package is available at github.com/innate2adaptive/Decombinator . The R package e1071 is available at the CRAN repository https://cran.r-project.org/web/packages/e1071/index.html . b.chain@ucl.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.

  6. PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine.

    PubMed

    Hayat, Maqsood; Tahir, Muhammad

    2015-08-01

    Membrane protein is a central component of the cell that manages intra and extracellular processes. Membrane proteins execute a diversity of functions that are vital for the survival of organisms. The topology of transmembrane proteins describes the number of transmembrane (TM) helix segments and its orientation. However, owing to the lack of its recognized structures, the identification of TM helix and its topology through experimental methods is laborious with low throughput. In order to identify TM helix segments reliably, accurately, and effectively from topogenic sequences, we propose the PSOFuzzySVM-TMH model. In this model, evolutionary based information position specific scoring matrix and discrete based information 6-letter exchange group are used to formulate transmembrane protein sequences. The noisy and extraneous attributes are eradicated using an optimization selection technique, particle swarm optimization, from both feature spaces. Finally, the selected feature spaces are combined in order to form ensemble feature space. Fuzzy-support vector Machine is utilized as a classification algorithm. Two benchmark datasets, including low and high resolution datasets, are used. At various levels, the performance of the PSOFuzzySVM-TMH model is assessed through 10-fold cross validation test. The empirical results reveal that the proposed framework PSOFuzzySVM-TMH outperforms in terms of classification performance in the examined datasets. It is ascertained that the proposed model might be a useful and high throughput tool for academia and research community for further structure and functional studies on transmembrane proteins.

  7. Prediction of Protein-Protein Interaction Sites by Random Forest Algorithm with mRMR and IFS

    PubMed Central

    Li, Bi-Qing; Feng, Kai-Yan; Chen, Lei; Huang, Tao; Cai, Yu-Dong

    2012-01-01

    Prediction of protein-protein interaction (PPI) sites is one of the most challenging problems in computational biology. Although great progress has been made by employing various machine learning approaches with numerous characteristic features, the problem is still far from being solved. In this study, we developed a novel predictor based on Random Forest (RF) algorithm with the Minimum Redundancy Maximal Relevance (mRMR) method followed by incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility. We also included five 3D structural features to predict protein-protein interaction sites and achieved an overall accuracy of 0.672997 and MCC of 0.347977. Feature analysis showed that 3D structural features such as Depth Index (DPX) and surface curvature (SC) contributed most to the prediction of protein-protein interaction sites. It was also shown via site-specific feature analysis that the features of individual residues from PPI sites contribute most to the determination of protein-protein interaction sites. It is anticipated that our prediction method will become a useful tool for identifying PPI sites, and that the feature analysis described in this paper will provide useful insights into the mechanisms of interaction. PMID:22937126

  8. Prediction of lysine ubiquitylation with ensemble classifier and feature selection.

    PubMed

    Zhao, Xiaowei; Li, Xiangtao; Ma, Zhiqiang; Yin, Minghao

    2011-01-01

    Ubiquitylation is an important process of post-translational modification. Correct identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems. This paper develops a novel computational method to effectively identify the lysine ubiquitylation sites based on the ensemble approach. In the proposed method, 468 ubiquitylation sites from 323 proteins retrieved from the Swiss-Prot database were encoded into feature vectors by using four kinds of protein sequences information. An effective feature selection method was then applied to extract informative feature subsets. After different feature subsets were obtained by setting different starting points in the search procedure, they were used to train multiple random forests classifiers and then aggregated into a consensus classifier by majority voting. Evaluated by jackknife tests and independent tests respectively, the accuracy of the proposed predictor reached 76.82% for the training dataset and 79.16% for the test dataset, indicating that this predictor is a useful tool to predict lysine ubiquitylation sites. Furthermore, site-specific feature analysis was performed and it was shown that ubiquitylation is intimately correlated with the features of its surrounding sites in addition to features derived from the lysine site itself. The feature selection method is available upon request.

  9. Mechanism for Coordinated RNA Packaging and Genome Replication by Rotavirus Polymerase VP1

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lu, Xiaohui; McDonald, Sarah M.; Tortorici, M. Alejandra

    2009-04-08

    Rotavirus RNA-dependent RNA polymerase VP1 catalyzes RNA synthesis within a subviral particle. This activity depends on core shell protein VP2. A conserved sequence at the 3' end of plus-strand RNA templates is important for polymerase association and genome replication. We have determined the structure of VP1 at 2.9 {angstrom} resolution, as apoenzyme and in complex with RNA. The cage-like enzyme is similar to reovirus {lambda}3, with four tunnels leading to or from a central, catalytic cavity. A distinguishing characteristic of VP1 is specific recognition, by conserved features of the template-entry channel, of four bases, UGUG, in the conserved 3' sequence.more » Well-defined interactions with these bases position the RNA so that its 3' end overshoots the initiating register, producing a stable but catalytically inactive complex. We propose that specific 3' end recognition selects rotavirus RNA for packaging and that VP2 activates the autoinhibited VP1/RNA complex to coordinate packaging and genome replication.« less

  10. DNA Breaks and End Resection Measured Genome-wide by End Sequencing.

    PubMed

    Canela, Andres; Sridharan, Sriram; Sciascia, Nicholas; Tubbs, Anthony; Meltzer, Paul; Sleckman, Barry P; Nussenzweig, André

    2016-09-01

    DNA double-strand breaks (DSBs) arise during physiological transcription, DNA replication, and antigen receptor diversification. Mistargeting or misprocessing of DSBs can result in pathological structural variation and mutation. Here we describe a sensitive method (END-seq) to monitor DNA end resection and DSBs genome-wide at base-pair resolution in vivo. We utilized END-seq to determine the frequency and spectrum of restriction-enzyme-, zinc-finger-nuclease-, and RAG-induced DSBs. Beyond sequence preference, chromatin features dictate the repertoire of these genome-modifying enzymes. END-seq can detect at least one DSB per cell among 10,000 cells not harboring DSBs, and we estimate that up to one out of 60 cells contains off-target RAG cleavage. In addition to site-specific cleavage, we detect DSBs distributed over extended regions during immunoglobulin class-switch recombination. Thus, END-seq provides a snapshot of DNA ends genome-wide, which can be utilized for understanding genome-editing specificities and the influence of chromatin on DSB pathway choice. Published by Elsevier Inc.

  11. Predicting protein-binding RNA nucleotides with consideration of binding partners.

    PubMed

    Tuvshinjargal, Narankhuu; Lee, Wook; Park, Byungkyu; Han, Kyungsook

    2015-06-01

    In recent years several computational methods have been developed to predict RNA-binding sites in protein. Most of these methods do not consider interacting partners of a protein, so they predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNAs. Unlike the problem of predicting RNA-binding sites in protein, the problem of predicting protein-binding sites in RNA has received little attention mainly because it is much more difficult and shows a lower accuracy on average. In our previous study, we developed a method that predicts protein-binding nucleotides from an RNA sequence. In an effort to improve the prediction accuracy and usefulness of the previous method, we developed a new method that uses both RNA and protein sequence data. In this study, we identified effective features of RNA and protein molecules and developed a new support vector machine (SVM) model to predict protein-binding nucleotides from RNA and protein sequence data. The new model that used both protein and RNA sequence data achieved a sensitivity of 86.5%, a specificity of 86.2%, a positive predictive value (PPV) of 72.6%, a negative predictive value (NPV) of 93.8% and Matthews correlation coefficient (MCC) of 0.69 in a 10-fold cross validation; it achieved a sensitivity of 58.8%, a specificity of 87.4%, a PPV of 65.1%, a NPV of 84.2% and MCC of 0.48 in independent testing. For comparative purpose, we built another prediction model that used RNA sequence data alone and ran it on the same dataset. In a 10 fold-cross validation it achieved a sensitivity of 85.7%, a specificity of 80.5%, a PPV of 67.7%, a NPV of 92.2% and MCC of 0.63; in independent testing it achieved a sensitivity of 67.7%, a specificity of 78.8%, a PPV of 57.6%, a NPV of 85.2% and MCC of 0.45. In both cross-validations and independent testing, the new model that used both RNA and protein sequences showed a better performance than the model that used RNA sequence data alone in most performance measures. To the best of our knowledge, this is the first sequence-based prediction of protein-binding nucleotides in RNA which considers the binding partner of RNA. The new model will provide valuable information for designing biochemical experiments to find putative protein-binding sites in RNA with unknown structure. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  12. The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation

    PubMed Central

    Casadio, Rita

    2017-01-01

    Abstract BAR 3.0 updates our server BAR (Bologna Annotation Resource) for predicting protein structural and functional features from sequence. We increase data volume, query capabilities and information conveyed to the user. The core of BAR 3.0 is a graph-based clustering procedure of UniProtKB sequences, following strict pairwise similarity criteria (sequence identity ≥40% with alignment coverage ≥90%). Each cluster contains the available annotation downloaded from UniProtKB, GO, PFAM and PDB. After statistical validation, GO terms and PFAM domains are cluster-specific and annotate new sequences entering the cluster after satisfying similarity constraints. BAR 3.0 includes 28 869 663 sequences in 1 361 773 clusters, of which 22.2% (22 241 661 sequences) and 47.4% (24 555 055 sequences) have at least one validated GO term and one PFAM domain, respectively. 1.4% of the clusters (36% of all sequences) include PDB structures and the cluster is associated to a hidden Markov model that allows building template-target alignment suitable for structural modeling. Some other 3 399 026 sequences are singletons. BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s. When evaluated on the CAFA2 targets, BAR 3.0 largely outperforms our previous version and scores among state-of-the-art methods. BAR 3.0 is publicly available and accessible at http://bar.biocomp.unibo.it/bar3. PMID:28453653

  13. Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts.

    PubMed

    Cocos, Anne; Fiks, Alexander G; Masino, Aaron J

    2017-07-01

    Social media is an important pharmacovigilance data source for adverse drug reaction (ADR) identification. Human review of social media data is infeasible due to data quantity, thus natural language processing techniques are necessary. Social media includes informal vocabulary and irregular grammar, which challenge natural language processing methods. Our objective is to develop a scalable, deep-learning approach that exceeds state-of-the-art ADR detection performance in social media. We developed a recurrent neural network (RNN) model that labels words in an input sequence with ADR membership tags. The only input features are word-embedding vectors, which can be formed through task-independent pretraining or during ADR detection training. Our best-performing RNN model used pretrained word embeddings created from a large, non-domain-specific Twitter dataset. It achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random field model. Feature analysis indicated that semantic information in pretrained word embeddings boosted sensitivity and, combined with contextual awareness captured in the RNN, precision. Our model required no task-specific feature engineering, suggesting generalizability to additional sequence-labeling tasks. Learning curve analysis showed that our model reached optimal performance with fewer training examples than the other models. ADR detection performance in social media is significantly improved by using a contextually aware model and word embeddings formed from large, unlabeled datasets. The approach reduces manual data-labeling requirements and is scalable to large social media datasets. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  14. "iSS-Hyb-mRMR": Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition.

    PubMed

    Iqbal, Muhammad; Hayat, Maqsood

    2016-05-01

    Gene splicing is a vital source of protein diversity. Perfectly eradication of introns and joining exons is the prominent task in eukaryotic gene expression, as exons are usually interrupted by introns. Identification of splicing sites through experimental techniques is complicated and time-consuming task. With the avalanche of genome sequences generated in the post genomic age, it remains a complicated and challenging task to develop an automatic, robust and reliable computational method for fast and effective identification of splicing sites. In this study, a hybrid model "iSS-Hyb-mRMR" is proposed for quickly and accurately identification of splicing sites. Two sample representation methods namely; pseudo trinucleotide composition (PseTNC) and pseudo tetranucleotide composition (PseTetraNC) were used to extract numerical descriptors from DNA sequences. Hybrid model was developed by concatenating PseTNC and PseTetraNC. In order to select high discriminative features, minimum redundancy maximum relevance algorithm was applied on the hybrid feature space. The performance of these feature representation methods was tested using various classification algorithms including K-nearest neighbor, probabilistic neural network, general regression neural network, and fitting network. Jackknife test was used for evaluation of its performance on two benchmark datasets S1 and S2, respectively. The predictor, proposed in the current study achieved an accuracy of 93.26%, sensitivity of 88.77%, and specificity of 97.78% for S1, and the accuracy of 94.12%, sensitivity of 87.14%, and specificity of 98.64% for S2, respectively. It is observed, that the performance of proposed model is higher than the existing methods in the literature so for; and will be fruitful in the mechanism of RNA splicing, and other research academia. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  15. Multi-channel MRI segmentation of eye structures and tumors using patient-specific features

    PubMed Central

    Ciller, Carlos; De Zanet, Sandro; Kamnitsas, Konstantinos; Maeder, Philippe; Glocker, Ben; Munier, Francis L.; Rueckert, Daniel; Thiran, Jean-Philippe

    2017-01-01

    Retinoblastoma and uveal melanoma are fast spreading eye tumors usually diagnosed by using 2D Fundus Image Photography (Fundus) and 2D Ultrasound (US). Diagnosis and treatment planning of such diseases often require additional complementary imaging to confirm the tumor extend via 3D Magnetic Resonance Imaging (MRI). In this context, having automatic segmentations to estimate the size and the distribution of the pathological tissue would be advantageous towards tumor characterization. Until now, the alternative has been the manual delineation of eye structures, a rather time consuming and error-prone task, to be conducted in multiple MRI sequences simultaneously. This situation, and the lack of tools for accurate eye MRI analysis, reduces the interest in MRI beyond the qualitative evaluation of the optic nerve invasion and the confirmation of recurrent malignancies below calcified tumors. In this manuscript, we propose a new framework for the automatic segmentation of eye structures and ocular tumors in multi-sequence MRI. Our key contribution is the introduction of a pathological eye model from which Eye Patient-Specific Features (EPSF) can be computed. These features combine intensity and shape information of pathological tissue while embedded in healthy structures of the eye. We assess our work on a dataset of pathological patient eyes by computing the Dice Similarity Coefficient (DSC) of the sclera, the cornea, the vitreous humor, the lens and the tumor. In addition, we quantitatively show the superior performance of our pathological eye model as compared to the segmentation obtained by using a healthy model (over 4% DSC) and demonstrate the relevance of our EPSF, which improve the final segmentation regardless of the classifier employed. PMID:28350816

  16. Species composition of the genus Saprolegnia in fin fish aquaculture environments, as determined by nucleotide sequence analysis of the nuclear rDNA ITS regions.

    PubMed

    de la Bastide, Paul Y; Leung, Wai Lam; Hintz, William E

    2015-01-01

    The ITS region of the rDNA gene was compared for Saprolegnia spp. in order to improve our understanding of nucleotide sequence variability within and between species of this genus, determine species composition in Canadian fin fish aquaculture facilities, and to assess the utility of ITS sequence variability in genetic marker development. From a collection of more than 400 field isolates, ITS region nucleotide sequences were studied and it was determined that there was sufficient consistent inter-specific variation to support the designation of species identity based on ITS sequence data. This non-subjective approach to species identification does not rely upon transient morphological features. Phylogenetic analyses comparing our ITS sequences and species designations with data from previous studies generally supported the clade scheme of Diéguez-Uribeondo et al. (2007) and found agreement with the molecular taxonomic cluster system of Sandoval-Sierra et al. (2014). Our Canadian ITS sequence collection will thus contribute to the public database and assist the clarification of Saprolegnia spp. taxonomy. The analysis of ITS region sequence variability facilitated genus- and species-level identification of unknown samples from aquaculture facilities and provided useful information on species composition. A unique ITS-RFLP for the identification of S. parasitica was also described. Copyright © 2014 The British Mycological Society. Published by Elsevier Ltd. All rights reserved.

  17. Trypanosomatidae: Phytomonas detection in plants and phytophagous insects by PCR amplification of a genus-specific sequence of the spliced leader gene.

    PubMed

    Serrano, M G; Nunes, L R; Campaner, M; Buck, G A; Camargo, E P; Teixeira, M M

    1999-03-01

    In this paper we describe a method for the detection of Phytomonas spp. from plants and phytophagous insects using the PCR technique by targeting a genus-specific sequence of the spliced leader (SL) gene. PCR amplification of DNA from 48 plant and insect isolates previously classified as Phytomonas by morphological, biochemical, and molecular criteria resulted in all cases in a 100-bp fragment that hybridized with the Phytomonas-specific spliced leader-derived probe SL3'. Moreover, this Phytomonas-specific PCR could also detect Phytomonas spp. in crude preparations of naturally infected plants and insects. This method shows no reaction with any other trypanosomatid genera or with plant and insect host DNA, revealing it to be able to detect Phytomonas spp. from fruit, latex, or phloem of various host plants as well as from salivary glands and digestive tubes of several species of insect hosts. Results demonstrated that SLPCR is a simple, fast, specific, and sensitive method that can be applied to the diagnosis of Phytomonas among cultured trypanosomatids and directly in plants and putative vector insects. Therefore, the method was shown to be a very specific and sensitive tool for diagnosis of Phytomonas without the need for isolation, culture, and DNA extraction of flagellates, a feature that is very convenient for practical and epidemiological purposes. Copyright 1999 Academic Press.

  18. Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes.

    PubMed

    Srinivasulu, Yerukala Sathipati; Wang, Jyun-Rong; Hsu, Kai-Ti; Tsai, Ming-Ju; Charoenkwan, Phasit; Huang, Wen-Lin; Huang, Hui-Ling; Ho, Shinn-Ying

    2015-01-01

    Protein-protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only. This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors (physicochemical, energetic and conformational properties of the 20 amino acids) to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.89, 0.83, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to estimate the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn. The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein complexes. The characterization analysis revealed that the average numbers of beta turns and hydrogen bonds at protein-protein interfaces in high binding affinity complexes are more than those in low binding affinity complexes.

  19. Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes

    PubMed Central

    2015-01-01

    Background Protein-protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only. Results This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors (physicochemical, energetic and conformational properties of the 20 amino acids) to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.89, 0.83, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to estimate the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn. Conclusions The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein complexes. The characterization analysis revealed that the average numbers of beta turns and hydrogen bonds at protein-protein interfaces in high binding affinity complexes are more than those in low binding affinity complexes. PMID:26681483

  20. Targeted exome sequencing and chromosomal microarray for the molecular diagnosis of nevoid basal cell carcinoma syndrome.

    PubMed

    Matsudate, Yoshihiro; Naruto, Takuya; Hayashi, Yumiko; Minami, Mitsuyoshi; Tohyama, Mikiko; Yokota, Kenji; Yamada, Daisuke; Imoto, Issei; Kubo, Yoshiaki

    2017-06-01

    Nevoid basal cell carcinoma syndrome (NBCCS) is an autosomal dominant disorder mainly caused by heterozygous mutations of PTCH1. In addition to characteristic clinical features, detection of a mutation in causative genes is reliable for the diagnosis of NBCCS; however, no mutations have been identified in some patients using conventional methods. To improve the method for the molecular diagnosis of NBCCS. We performed targeted exome sequencing (TES) analysis using a multi-gene panel, including PTCH1, PTCH2, SUFU, and other sonic hedgehog signaling pathway-related genes, based on next-generation sequencing (NGS) technology in 8 cases in whom possible causative mutations were not detected by previously performed conventional analysis and 2 recent cases of NBCCS. Subsequent analysis of gross deletion within or around PTCH1 detected by TES was performed using chromosomal microarray (CMA). Through TES analysis, specific single nucleotide variants or small indels of PTCH1 causing inferred amino acid changes were identified in 2 novel cases and 2 undiagnosed cases, whereas gross deletions within or around PTCH1, which are validated by CMA, were found in 3 undiagnosed cases. However, no mutations were detected even by TES in 3 cases. Among 3 cases with gross deletions of PTCH1, deletions containing the entire PTCH1 and additional neighboring genes were detected in 2 cases, one of which exhibited atypical clinical features, such as severe mental retardation, likely associated with genes located within the 4.3Mb deleted region, especially. TES-based simultaneous evaluation of sequences and copy number status in all targeted coding exons by NGS is likely to be more useful for the molecular diagnosis of NBCCS than conventional methods. CMA is recommended as a subsequent analysis for validation and detailed mapping of deleted regions, which may explain the atypical clinical features of NBCCS cases. Copyright © 2017 Japanese Society for Investigative Dermatology. Published by Elsevier B.V. All rights reserved.

  1. Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection.

    PubMed

    Ma, Xin; Guo, Jing; Sun, Xiao

    2015-01-01

    The prediction of RNA-binding proteins is one of the most challenging problems in computation biology. Although some studies have investigated this problem, the accuracy of prediction is still not sufficient. In this study, a highly accurate method was developed to predict RNA-binding proteins from amino acid sequences using random forests with the minimum redundancy maximum relevance (mRMR) method, followed by incremental feature selection (IFS). We incorporated features of conjoint triad features and three novel features: binding propensity (BP), nonbinding propensity (NBP), and evolutionary information combined with physicochemical properties (EIPP). The results showed that these novel features have important roles in improving the performance of the predictor. Using the mRMR-IFS method, our predictor achieved the best performance (86.62% accuracy and 0.737 Matthews correlation coefficient). High prediction accuracy and successful prediction performance suggested that our method can be a useful approach to identify RNA-binding proteins from sequence information.

  2. ZifBASE: a database of zinc finger proteins and associated resources.

    PubMed

    Jayakanthan, Mannu; Muthukumaran, Jayaraman; Chandrasekar, Sanniyasi; Chawla, Konika; Punetha, Ankita; Sundar, Durai

    2009-09-09

    Information on the occurrence of zinc finger protein motifs in genomes is crucial to the developing field of molecular genome engineering. The knowledge of their target DNA-binding sequences is vital to develop chimeric proteins for targeted genome engineering and site-specific gene correction. There is a need to develop a computational resource of zinc finger proteins (ZFP) to identify the potential binding sites and its location, which reduce the time of in vivo task, and overcome the difficulties in selecting the specific type of zinc finger protein and the target site in the DNA sequence. ZifBASE provides an extensive collection of various natural and engineered ZFP. It uses standard names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, Protein Data Bank, ModBase, Protein Model Portal and the literature. It also incorporates specialized features of ZFP including finger sequences and positions, number of fingers, physiochemical properties, classes, framework, PubMed citations with links to experimental structures (PDB, if available) and modeled structures of natural zinc finger proteins. ZifBASE provides information on zinc finger proteins (both natural and engineered ones), the number of finger units in each of the zinc finger proteins (with multiple fingers), the synergy between the adjacent fingers and their positions. Additionally, it gives the individual finger sequence and their target DNA site to which it binds for better and clear understanding on the interactions of adjacent fingers. The current version of ZifBASE contains 139 entries of which 89 are engineered ZFPs, containing 3-7F totaling to 296 fingers. There are 50 natural zinc finger protein entries ranging from 2-13F, totaling to 307 fingers. It has sequences and structures from literature, Protein Data Bank, ModBase and Protein Model Portal. The interface is cross linked to other public databases like UniprotKB, PDB, ModBase and Protein Model Portal and PubMed for making it more informative. A database is established to maintain the information of the sequence features, including the class, framework, number of fingers, residues, position, recognition site and physio-chemical properties (molecular weight, isoelectric point) of both natural and engineered zinc finger proteins and dissociation constant of few. ZifBASE can provide more effective and efficient way of accessing the zinc finger protein sequences and their target binding sites with the links to their three-dimensional structures. All the data and functions are available at the advanced web-based search interface http://web.iitd.ac.in/~sundar/zifbase.

  3. Microfabricated Fountain Pens for High-Density DNA Arrays

    PubMed Central

    Reese, Matthew O.; van Dam, R. Michae; Scherer, Axel; Quake, Stephen R.

    2003-01-01

    We used photolithographic microfabrication techniques to create very small stainless steel fountain pens that were installed in place of conventional pens on a microarray spotter. Because of the small feature size produced by the microfabricated pens, we were able to print arrays with up to 25,000 spots/cm2, significantly higher than can be achieved by other deposition methods. This feature density is sufficiently large that a standard microscope slide can contain multiple replicates of every gene in a complex organism such as a mouse or human. We tested carryover during array printing with dye solution, labeled DNA, and hybridized DNA, and we found it to be indistinguishable from background. Hybridization also showed good sequence specificity to printed oligonucleotides. In addition to improved slide capacity, the microfabrication process offers the possibility of low-cost mass-produced pens and the flexibility to include novel pen features that cannot be machined with conventional techniques. PMID:12975313

  4. Electrophysiological correlates of the retention of tones differing in timbre in auditory short-term memory.

    PubMed

    Nolden, Sophie; Bermudez, Patrick; Alunni-Menichini, Kristelle; Lefebvre, Christine; Grimault, Stephan; Jolicoeur, Pierre

    2013-11-01

    We examined the electrophysiological correlates of retention in auditory short-term memory (ASTM) for sequences of one, two, or three tones differing in timbre but having the same pitch. We focused on event-related potentials (ERPs) during the retention interval and revealed a sustained fronto-central ERP component (most likely a sustained anterior negativity; SAN) that became more negative as memory load increased. Our results are consistent with recent ERP studies on the retention of pitch and suggest that the SAN reflects brain activity mediating the low-level retention of basic acoustic features in ASTM. The present work shows that the retention of timbre shares common features with the retention of pitch, hence supporting the notion that the retention of basic sensory features is an active process that recruits modality-specific brain areas. © 2013 Elsevier Ltd. All rights reserved.

  5. Molecular Structure and Sequence in Complex Coacervates

    NASA Astrophysics Data System (ADS)

    Sing, Charles; Lytle, Tyler; Madinya, Jason; Radhakrishna, Mithun

    Oppositely-charged polyelectrolytes in aqueous solution can undergo associative phase separation, in a process known as complex coacervation. This results in a polyelectrolyte-dense phase (coacervate) and polyelectrolyte-dilute phase (supernatant). There remain challenges in understanding this process, despite a long history in polymer physics. We use Monte Carlo simulation to demonstrate that molecular features (charge spacing, size) play a crucial role in governing the equilibrium in coacervates. We show how these molecular features give rise to strong monomer sequence effects, due to a combination of counterion condensation and correlation effects. We distinguish between structural and sequence-based correlations, which can be designed to tune the phase diagram of coacervation. Sequence effects further inform the physical understanding of coacervation, and provide the basis for new coacervation models that take monomer-level features into account.

  6. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.

    PubMed

    Nath, Abhigyan; Subbiah, Karthikeyan

    2015-12-01

    Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method. Copyright © 2015 Elsevier Ltd. All rights reserved.

  7. Molecular Strain Typing of Mycobacterium tuberculosis: a Review of Frequently Used Methods

    PubMed Central

    2016-01-01

    Tuberculosis, caused by the bacterium Mycobacterium tuberculosis, remains one of the most serious global health problems. Molecular typing of M. tuberculosis has been used for various epidemiologic purposes as well as for clinical management. Currently, many techniques are available to type M. tuberculosis. Choosing the most appropriate technique in accordance with the existing laboratory conditions and the specific features of the geographic region is important. Insertion sequence IS6110-based restriction fragment length polymorphism (RFLP) analysis is considered the gold standard for the molecular epidemiologic investigations of tuberculosis. However, other polymerase chain reaction-based methods such as spacer oligonucleotide typing (spoligotyping), which detects 43 spacer sequence-interspersing direct repeats (DRs) in the genomic DR region; mycobacterial interspersed repetitive units–variable number tandem repeats, (MIRU-VNTR), which determines the number and size of tandem repetitive DNA sequences; repetitive-sequence-based PCR (rep-PCR), which provides high-throughput genotypic fingerprinting of multiple Mycobacterium species; and the recently developed genome-based whole genome sequencing methods demonstrate similar discriminatory power and greater convenience. This review focuses on techniques frequently used for the molecular typing of M. tuberculosis and discusses their general aspects and applications. PMID:27709842

  8. HmtDB 2016: data update, a better performing query system and human mitochondrial DNA haplogroup predictor

    PubMed Central

    Clima, Rosanna; Preste, Roberto; Calabrese, Claudia; Diroma, Maria Angela; Santorsola, Mariangela; Scioscia, Gaetano; Simone, Domenico; Shen, Lishuang; Gasparre, Giuseppe; Attimonelli, Marcella

    2017-01-01

    The HmtDB resource hosts a database of human mitochondrial genome sequences from individuals with healthy and disease phenotypes. The database is intended to support both population geneticists as well as clinicians undertaking the task to assess the pathogenicity of specific mtDNA mutations. The wide application of next-generation sequencing (NGS) has provided an enormous volume of high-resolution data at a low price, increasing the availability of human mitochondrial sequencing data, which called for a cogent and significant expansion of HmtDB data content that has more than tripled in the current release. We here describe additional novel features, including: (i) a complete, user-friendly restyling of the web interface, (ii) links to the command-line stand-alone and web versions of the MToolBox package, an up-to-date tool to reconstruct and analyze human mitochondrial DNA from NGS data and (iii) the implementation of the Reconstructed Sapiens Reference Sequence (RSRS) as mitochondrial reference sequence. The overall update renders HmtDB an even more handy and useful resource as it enables a more rapid data access, processing and analysis. HmtDB is accessible at http://www.hmtdb.uniba.it/. PMID:27899581

  9. Molecular Strain Typing of Mycobacterium tuberculosis: a Review of Frequently Used Methods.

    PubMed

    Ei, Phyu Win; Aung, Wah Wah; Lee, Jong Seok; Choi, Go Eun; Chang, Chulhun L

    2016-11-01

    Tuberculosis, caused by the bacterium Mycobacterium tuberculosis, remains one of the most serious global health problems. Molecular typing of M. tuberculosis has been used for various epidemiologic purposes as well as for clinical management. Currently, many techniques are available to type M. tuberculosis. Choosing the most appropriate technique in accordance with the existing laboratory conditions and the specific features of the geographic region is important. Insertion sequence IS6110-based restriction fragment length polymorphism (RFLP) analysis is considered the gold standard for the molecular epidemiologic investigations of tuberculosis. However, other polymerase chain reaction-based methods such as spacer oligonucleotide typing (spoligotyping), which detects 43 spacer sequence-interspersing direct repeats (DRs) in the genomic DR region; mycobacterial interspersed repetitive units-variable number tandem repeats, (MIRU-VNTR), which determines the number and size of tandem repetitive DNA sequences; repetitive-sequence-based PCR (rep-PCR), which provides high-throughput genotypic fingerprinting of multiple Mycobacterium species; and the recently developed genome-based whole genome sequencing methods demonstrate similar discriminatory power and greater convenience. This review focuses on techniques frequently used for the molecular typing of M. tuberculosis and discusses their general aspects and applications.

  10. Combined sequence and structure analysis of the fungal laccase family.

    PubMed

    Kumar, S V Suresh; Phale, Prashant S; Durani, S; Wangikar, Pramod P

    2003-08-20

    Plant and fungal laccases belong to the family of multi-copper oxidases and show much broader substrate specificity than other members of the family. Laccases have consequently been of interest for potential industrial applications. We have analyzed the essential sequence features of fungal laccases based on multiple sequence alignments of more than 100 laccases. This has resulted in identification of a set of four ungapped sequence regions, L1-L4, as the overall signature sequences that can be used to identify the laccases, distinguishing them within the broader class of multi-copper oxidases. The 12 amino acid residues in the enzymes serving as the copper ligands are housed within these four identified conserved regions, of which L2 and L4 conform to the earlier reported copper signature sequences of multi-copper oxidases while L1 and L3 are distinctive to the laccases. The mapping of regions L1-L4 on to the three-dimensional structure of the Coprinus cinerius laccase indicates that many of the non-copper-ligating residues of the conserved regions could be critical in maintaining a specific, more or less C-2 symmetric, protein conformational motif characterizing the active site apparatus of the enzymes. The observed intraprotein homologies between L1 and L3 and between L2 and L4 at both the structure and the sequence levels suggest that the quasi C-2 symmetric active site conformational motif may have arisen from a structural duplication event that neither the sequence homology analysis nor the structure homology analysis alone would have unraveled. Although the sequence and structure homology is not detectable in the rest of the protein, the relative orientation of region L1 with L2 is similar to that of L3 with L4. The structure duplication of first-shell and second-shell residues has become cryptic because the intraprotein sequence homology noticeable for a given laccase becomes significant only after comparing the conservation pattern in several fungal laccases. The identified motifs, L1-L4, can be useful in searching the newly sequenced genomes for putative laccase enzymes. Copyright 2003 Wiley Periodicals, Inc. Biotechnol Bioeng 83: 386-394, 2003.

  11. DNA-Encoded Solid-Phase Synthesis: Encoding Language Design and Complex Oligomer Library Synthesis.

    PubMed

    MacConnell, Andrew B; McEnaney, Patrick J; Cavett, Valerie J; Paegel, Brian M

    2015-09-14

    The promise of exploiting combinatorial synthesis for small molecule discovery remains unfulfilled due primarily to the "structure elucidation problem": the back-end mass spectrometric analysis that significantly restricts one-bead-one-compound (OBOC) library complexity. The very molecular features that confer binding potency and specificity, such as stereochemistry, regiochemistry, and scaffold rigidity, are conspicuously absent from most libraries because isomerism introduces mass redundancy and diverse scaffolds yield uninterpretable MS fragmentation. Here we present DNA-encoded solid-phase synthesis (DESPS), comprising parallel compound synthesis in organic solvent and aqueous enzymatic ligation of unprotected encoding dsDNA oligonucleotides. Computational encoding language design yielded 148 thermodynamically optimized sequences with Hamming string distance ≥ 3 and total read length <100 bases for facile sequencing. Ligation is efficient (70% yield), specific, and directional over 6 encoding positions. A series of isomers served as a testbed for DESPS's utility in split-and-pool diversification. Single-bead quantitative PCR detected 9 × 10(4) molecules/bead and sequencing allowed for elucidation of each compound's synthetic history. We applied DESPS to the combinatorial synthesis of a 75,645-member OBOC library containing scaffold, stereochemical and regiochemical diversity using mixed-scale resin (160-μm quality control beads and 10-μm screening beads). Tandem DNA sequencing/MALDI-TOF MS analysis of 19 quality control beads showed excellent agreement (<1 ppt) between DNA sequence-predicted mass and the observed mass. DESPS synergistically unites the advantages of solid-phase synthesis and DNA encoding, enabling single-bead structural elucidation of complex compounds and synthesis using reactions normally considered incompatible with unprotected DNA. The widespread availability of inexpensive oligonucleotide synthesis, enzymes, DNA sequencing, and PCR make implementation of DESPS straightforward, and may prompt the chemistry community to revisit the synthesis of more complex and diverse libraries.

  12. Mixed-complexity artificial grammar learning in humans and macaque monkeys: evaluating learning strategies.

    PubMed

    Wilson, Benjamin; Smith, Kenny; Petkov, Christopher I

    2015-03-01

    Artificial grammars (AG) can be used to generate rule-based sequences of stimuli. Some of these can be used to investigate sequence-processing computations in non-human animals that might be related to, but not unique to, human language. Previous AG learning studies in non-human animals have used different AGs to separately test for specific sequence-processing abilities. However, given that natural language and certain animal communication systems (in particular, song) have multiple levels of complexity, mixed-complexity AGs are needed to simultaneously evaluate sensitivity to the different features of the AG. Here, we tested humans and Rhesus macaques using a mixed-complexity auditory AG, containing both adjacent (local) and non-adjacent (longer-distance) relationships. Following exposure to exemplary sequences generated by the AG, humans and macaques were individually tested with sequences that were either consistent with the AG or violated specific adjacent or non-adjacent relationships. We observed a considerable level of cross-species correspondence in the sensitivity of both humans and macaques to the adjacent AG relationships and to the statistical properties of the sequences. We found no significant sensitivity to the non-adjacent AG relationships in the macaques. A subset of humans was sensitive to this non-adjacent relationship, revealing interesting between- and within-species differences in AG learning strategies. The results suggest that humans and macaques are largely comparably sensitive to the adjacent AG relationships and their statistical properties. However, in the presence of multiple cues to grammaticality, the non-adjacent relationships are less salient to the macaques and many of the humans. © 2015 The Authors. European Journal of Neuroscience published by Federation of European Neuroscience Societies and John Wiley & Sons Ltd.

  13. Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates.

    PubMed

    Frésard, Laure; Leroux, Sophie; Roux, Pierre-François; Klopp, Christophe; Fabre, Stéphane; Esquerré, Diane; Dehais, Patrice; Djari, Anis; Gourichon, David; Lagarrigue, Sandrine; Pitel, Frédérique

    2015-01-01

    RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome.

  14. Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates

    PubMed Central

    Frésard, Laure; Leroux, Sophie; Roux, Pierre-François; Klopp, Christophe; Fabre, Stéphane; Esquerré, Diane; Dehais, Patrice; Djari, Anis; Gourichon, David

    2015-01-01

    RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome. PMID:26024316

  15. Recurrence time statistics: versatile tools for genomic DNA sequence analysis.

    PubMed

    Cao, Yinhe; Tung, Wen-Wen; Gao, J B

    2004-01-01

    With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.

  16. Exploring 3D Human Action Recognition: from Offline to Online.

    PubMed

    Liu, Zhenyu; Li, Rui; Tan, Jianrong

    2018-02-20

    With the introduction of cost-effective depth sensors, a tremendous amount of research has been devoted to studying human action recognition using 3D motion data. However, most existing methods work in an offline fashion, i.e., they operate on a segmented sequence. There are a few methods specifically designed for online action recognition, which continually predicts action labels as a stream sequence proceeds. In view of this fact, we propose a question: can we draw inspirations and borrow techniques or descriptors from existing offline methods, and then apply these to online action recognition? Note that extending offline techniques or descriptors to online applications is not straightforward, since at least two problems-including real-time performance and sequence segmentation-are usually not considered in offline action recognition. In this paper, we give a positive answer to the question. To develop applicable online action recognition methods, we carefully explore feature extraction, sequence segmentation, computational costs, and classifier selection. The effectiveness of the developed methods is validated on the MSR 3D Online Action dataset and the MSR Daily Activity 3D dataset.

  17. Exploring 3D Human Action Recognition: from Offline to Online

    PubMed Central

    Li, Rui; Liu, Zhenyu; Tan, Jianrong

    2018-01-01

    With the introduction of cost-effective depth sensors, a tremendous amount of research has been devoted to studying human action recognition using 3D motion data. However, most existing methods work in an offline fashion, i.e., they operate on a segmented sequence. There are a few methods specifically designed for online action recognition, which continually predicts action labels as a stream sequence proceeds. In view of this fact, we propose a question: can we draw inspirations and borrow techniques or descriptors from existing offline methods, and then apply these to online action recognition? Note that extending offline techniques or descriptors to online applications is not straightforward, since at least two problems—including real-time performance and sequence segmentation—are usually not considered in offline action recognition. In this paper, we give a positive answer to the question. To develop applicable online action recognition methods, we carefully explore feature extraction, sequence segmentation, computational costs, and classifier selection. The effectiveness of the developed methods is validated on the MSR 3D Online Action dataset and the MSR Daily Activity 3D dataset. PMID:29461502

  18. Sequence, structure and function relationships in flaviviruses as assessed by evolutive aspects of its conserved non-structural protein domains.

    PubMed

    da Fonseca, Néli José; Lima Afonso, Marcelo Querino; Pedersolli, Natan Gonçalves; de Oliveira, Lucas Carrijo; Andrade, Dhiego Souto; Bleicher, Lucas

    2017-10-28

    Flaviviruses are responsible for serious diseases such as dengue, yellow fever, and zika fever. Their genomes encode a polyprotein which, after cleavage, results in three structural and seven non-structural proteins. Homologous proteins can be studied by conservation and coevolution analysis as detected in multiple sequence alignments, usually reporting positions which are strictly necessary for the structure and/or function of all members in a protein family or which are involved in a specific sub-class feature requiring the coevolution of residue sets. This study provides a complete conservation and coevolution analysis on all flaviviruses non-structural proteins, with results mapped on all well-annotated available sequences. A literature review on the residues found in the analysis enabled us to compile available information on their roles and distribution among different flaviviruses. Also, we provide the mapping of conserved and coevolved residues for all sequences currently in SwissProt as a supplementary material, so that particularities in different viruses can be easily analyzed. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. Foldamer hypothesis for the growth and sequence differentiation of prebiotic polymers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Guseva, Elizaveta; Zuckermann, Ronald N.; Dill, Ken A.

    It is not known how life originated. It is thought that prebiotic processes were able to synthesize short random polymers. However, then, how do short-chain molecules spontaneously grow longer? Also, how would random chains grow more informational and become autocatalytic (i.e., increasing their own concentrations)? We study the folding and binding of random sequences of hydrophobic ( H) and polar ( P) monomers in a computational model. We find that even short hydrophobic polar ( HP) chains can collapse into relatively compact structures, exposing hydrophobic surfaces. In this way, they act as primitive versions of today’s protein catalysts, elongating othermore » such HP polymers as ribosomes would now do. Such foldamer catalysts are shown to form an autocatalytic set, through which short chains grow into longer chains that have particular sequences. An attractive feature of this model is that it does not overconverge to a single solution; it gives ensembles that could further evolve under selection. This mechanism describes how specific sequences and conformations could contribute to the chemistry-to-biology (CTB) transition.« less

  20. Bioinformatic prediction and in vivo validation of residue-residue interactions in human proteins

    NASA Astrophysics Data System (ADS)

    Jordan, Daniel; Davis, Erica; Katsanis, Nicholas; Sunyaev, Shamil

    2014-03-01

    Identifying residue-residue interactions in protein molecules is important for understanding both protein structure and function in the context of evolutionary dynamics and medical genetics. Such interactions can be difficult to predict using existing empirical or physical potentials, especially when residues are far from each other in sequence space. Using a multiple sequence alignment of 46 diverse vertebrate species we explore the space of allowed sequences for orthologous protein families. Amino acid changes that are known to damage protein function allow us to identify specific changes that are likely to have interacting partners. We fit the parameters of the continuous-time Markov process used in the alignment to conclude that these interactions are primarily pairwise, rather than higher order. Candidates for sites under pairwise epistasis are predicted, which can then be tested by experiment. We report the results of an initial round of in vivo experiments in a zebrafish model that verify the presence of multiple pairwise interactions predicted by our model. These experimentally validated interactions are novel, distant in sequence, and are not readily explained by known biochemical or biophysical features.

  1. Evolutionary and biophysical relationships among the papillomavirus E2 proteins.

    PubMed

    Blakaj, Dukagjin M; Fernandez-Fuentes, Narcis; Chen, Zigui; Hegde, Rashmi; Fiser, Andras; Burk, Robert D; Brenowitz, Michael

    2009-01-01

    Infection by human papillomavirus (HPV) may result in clinical conditions ranging from benign warts to invasive cancer. The HPV E2 protein represses oncoprotein transcription and is required for viral replication. HPV E2 binds to palindromic DNA sequences of highly conserved four base pair sequences flanking an identical length variable 'spacer'. E2 proteins directly contact the conserved but not the spacer DNA. Variation in naturally occurring spacer sequences results in differential protein affinity that is dependent on their sensitivity to the spacer DNA's unique conformational and/or dynamic properties. This article explores the biophysical character of this core viral protein with the goal of identifying characteristics that associated with risk of virally caused malignancy. The amino acid sequence, 3d structure and electrostatic features of the E2 protein DNA binding domain are highly conserved; specific interactions with DNA binding sites have also been conserved. In contrast, the E2 protein's transactivation domain does not have extensive surfaces of highly conserved residues. Rather, regions of high conservation are localized to small surface patches. Implications to cancer biology are discussed.

  2. Organization of nif gene cluster in Frankia sp. EuIK1 strain, a symbiont of Elaeagnus umbellata.

    PubMed

    Oh, Chang Jae; Kim, Ho Bang; Kim, Jitae; Kim, Won Jin; Lee, Hyoungseok; An, Chung Sun

    2012-01-01

    The nucleotide sequence of a 20.5-kb genomic region harboring nif genes was determined and analyzed. The fragment was obtained from Frankia sp. EuIK1 strain, an indigenous symbiont of Elaeagnus umbellata. A total of 20 ORFs including 12 nif genes were identified and subjected to comparative analysis with the genome sequences of 3 Frankia strains representing diverse host plant specificities. The nucleotide and deduced amino acid sequences showed highest levels of identity with orthologous genes from an Elaeagnus-infecting strain. The gene organization patterns around the nif gene clusters were well conserved among all 4 Frankia strains. However, characteristic features appeared in the location of the nifV gene for each Frankia strain, depending on the type of host plant. Sequence analysis was performed to determine the transcription units and suggested that there could be an independent operon starting from the nifW gene in the EuIK strain. Considering the organization patterns and their total extensions on the genome, we propose that the nif gene clusters remained stable despite genetic variations occurring in the Frankia genomes.

  3. NEBNext Direct: A Novel, Rapid, Hybridization-Based Approach for the Capture and Library Conversion of Genomic Regions of Interest.

    PubMed

    Emerman, Amy B; Bowman, Sarah K; Barry, Andrew; Henig, Noa; Patel, Kruti M; Gardner, Andrew F; Hendrickson, Cynthia L

    2017-07-05

    Next-generation sequencing (NGS) is a powerful tool for genomic studies, translational research, and clinical diagnostics that enables the detection of single nucleotide polymorphisms, insertions and deletions, copy number variations, and other genetic variations. Target enrichment technologies improve the efficiency of NGS by only sequencing regions of interest, which reduces sequencing costs while increasing coverage of the selected targets. Here we present NEBNext Direct ® , a hybridization-based, target-enrichment approach that addresses many of the shortcomings of traditional target-enrichment methods. This approach features a simple, 7-hr workflow that uses enzymatic removal of off-target sequences to achieve a high specificity for regions of interest. Additionally, unique molecular identifiers are incorporated for the identification and filtering of PCR duplicates. The same protocol can be used across a wide range of input amounts, input types, and panel sizes, enabling NEBNext Direct to be broadly applicable across a wide variety of research and diagnostic needs. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.

  4. Analysis of SINE and LINE repeat content of Y chromosomes in the platypus, Ornithorhynchus anatinus.

    PubMed

    Kortschak, R Daniel; Tsend-Ayush, Enkhjargal; Grützner, Frank

    2009-01-01

    Monotremes feature an extraordinary sex-chromosome system that consists of five X and five Y chromosomes in males. These sex chromosomes share homology with bird sex chromosomes but no homology with the therian X. The genome of a female platypus was recently completed, providing unique insights into sequence and gene content of autosomes and X chromosomes, but no Y-specific sequence has so far been analysed. Here we report the isolation, sequencing and analysis of approximately 700 kb of sequence of the non-recombining regions of Y2, Y3 and Y5, which revealed differences in base composition and repeat content between autosomes and sex chromosomes, and within the sex chromosomes themselves. This provides the first insights into repeat content of Y chromosomes in platypus, which overall show similar patterns of repeat composition to Y chromosomes in other species. Interestingly, we also observed differences between the various Y chromosomes, and in combination with timing and activity patterns we provide an approach that can be used to examine the evolutionary history of the platypus sex-chromosome chain.

  5. Foldamer hypothesis for the growth and sequence differentiation of prebiotic polymers

    DOE PAGES

    Guseva, Elizaveta; Zuckermann, Ronald N.; Dill, Ken A.

    2017-08-22

    It is not known how life originated. It is thought that prebiotic processes were able to synthesize short random polymers. However, then, how do short-chain molecules spontaneously grow longer? Also, how would random chains grow more informational and become autocatalytic (i.e., increasing their own concentrations)? We study the folding and binding of random sequences of hydrophobic ( H) and polar ( P) monomers in a computational model. We find that even short hydrophobic polar ( HP) chains can collapse into relatively compact structures, exposing hydrophobic surfaces. In this way, they act as primitive versions of today’s protein catalysts, elongating othermore » such HP polymers as ribosomes would now do. Such foldamer catalysts are shown to form an autocatalytic set, through which short chains grow into longer chains that have particular sequences. An attractive feature of this model is that it does not overconverge to a single solution; it gives ensembles that could further evolve under selection. This mechanism describes how specific sequences and conformations could contribute to the chemistry-to-biology (CTB) transition.« less

  6. Foldamer hypothesis for the growth and sequence differentiation of prebiotic polymers

    PubMed Central

    Guseva, Elizaveta; Zuckermann, Ronald N.; Dill, Ken A.

    2017-01-01

    It is not known how life originated. It is thought that prebiotic processes were able to synthesize short random polymers. However, then, how do short-chain molecules spontaneously grow longer? Also, how would random chains grow more informational and become autocatalytic (i.e., increasing their own concentrations)? We study the folding and binding of random sequences of hydrophobic (H) and polar (P) monomers in a computational model. We find that even short hydrophobic polar (HP) chains can collapse into relatively compact structures, exposing hydrophobic surfaces. In this way, they act as primitive versions of today’s protein catalysts, elongating other such HP polymers as ribosomes would now do. Such foldamer catalysts are shown to form an autocatalytic set, through which short chains grow into longer chains that have particular sequences. An attractive feature of this model is that it does not overconverge to a single solution; it gives ensembles that could further evolve under selection. This mechanism describes how specific sequences and conformations could contribute to the chemistry-to-biology (CTB) transition. PMID:28831002

  7. Time-Elastic Generative Model for Acceleration Time Series in Human Activity Recognition

    PubMed Central

    Munoz-Organero, Mario; Ruiz-Blazquez, Ramona

    2017-01-01

    Body-worn sensors in general and accelerometers in particular have been widely used in order to detect human movements and activities. The execution of each type of movement by each particular individual generates sequences of time series of sensed data from which specific movement related patterns can be assessed. Several machine learning algorithms have been used over windowed segments of sensed data in order to detect such patterns in activity recognition based on intermediate features (either hand-crafted or automatically learned from data). The underlying assumption is that the computed features will capture statistical differences that can properly classify different movements and activities after a training phase based on sensed data. In order to achieve high accuracy and recall rates (and guarantee the generalization of the system to new users), the training data have to contain enough information to characterize all possible ways of executing the activity or movement to be detected. This could imply large amounts of data and a complex and time-consuming training phase, which has been shown to be even more relevant when automatically learning the optimal features to be used. In this paper, we present a novel generative model that is able to generate sequences of time series for characterizing a particular movement based on the time elasticity properties of the sensed data. The model is used to train a stack of auto-encoders in order to learn the particular features able to detect human movements. The results of movement detection using a newly generated database with information on five users performing six different movements are presented. The generalization of results using an existing database is also presented in the paper. The results show that the proposed mechanism is able to obtain acceptable recognition rates (F = 0.77) even in the case of using different people executing a different sequence of movements and using different hardware. PMID:28208736

  8. Time-Elastic Generative Model for Acceleration Time Series in Human Activity Recognition.

    PubMed

    Munoz-Organero, Mario; Ruiz-Blazquez, Ramona

    2017-02-08

    Body-worn sensors in general and accelerometers in particular have been widely used in order to detect human movements and activities. The execution of each type of movement by each particular individual generates sequences of time series of sensed data from which specific movement related patterns can be assessed. Several machine learning algorithms have been used over windowed segments of sensed data in order to detect such patterns in activity recognition based on intermediate features (either hand-crafted or automatically learned from data). The underlying assumption is that the computed features will capture statistical differences that can properly classify different movements and activities after a training phase based on sensed data. In order to achieve high accuracy and recall rates (and guarantee the generalization of the system to new users), the training data have to contain enough information to characterize all possible ways of executing the activity or movement to be detected. This could imply large amounts of data and a complex and time-consuming training phase, which has been shown to be even more relevant when automatically learning the optimal features to be used. In this paper, we present a novel generative model that is able to generate sequences of time series for characterizing a particular movement based on the time elasticity properties of the sensed data. The model is used to train a stack of auto-encoders in order to learn the particular features able to detect human movements. The results of movement detection using a newly generated database with information on five users performing six different movements are presented. The generalization of results using an existing database is also presented in the paper. The results show that the proposed mechanism is able to obtain acceptable recognition rates ( F = 0.77) even in the case of using different people executing a different sequence of movements and using different hardware.

  9. SigWin-detector: a Grid-enabled workflow for discovering enriched windows of genomic features related to DNA sequences.

    PubMed

    Inda, Márcia A; van Batenburg, Marinus F; Roos, Marco; Belloum, Adam S Z; Vasunin, Dmitry; Wibisono, Adianto; van Kampen, Antoine H C; Breit, Timo M

    2008-08-08

    Chromosome location is often used as a scaffold to organize genomic information in both the living cell and molecular biological research. Thus, ever-increasing amounts of data about genomic features are stored in public databases and can be readily visualized by genome browsers. To perform in silico experimentation conveniently with this genomics data, biologists need tools to process and compare datasets routinely and explore the obtained results interactively. The complexity of such experimentation requires these tools to be based on an e-Science approach, hence generic, modular, and reusable. A virtual laboratory environment with workflows, workflow management systems, and Grid computation are therefore essential. Here we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way. For proof-of-principle, we utilize a biological use case to detect regions of increased and decreased gene expression (RIDGEs and anti-RIDGEs) in human transcriptome maps. We improved the original method for RIDGE detection by replacing the costly step of estimation by random sampling with a faster analytical formula for computing the distribution of the null hypothesis being tested and by developing a new algorithm for computing moving medians. SigWin-detector was developed using the WS-VLAM workflow management system and consists of several reusable modules that are linked together in a basic workflow. The configuration of this basic workflow can be adapted to satisfy the requirements of the specific in silico experiment. As we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values. Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.

  10. Predicting Protein-Protein Interactions by Combing Various Sequence-Derived.

    PubMed

    Zhao, Xiao-Wei; Ma, Zhi-Qiang; Yin, Ming-Hao

    2011-09-20

    Knowledge of protein-protein interactions (PPIs) plays an important role in constructing protein interaction networks and understanding the general machineries of biological systems. In this study, a new method is proposed to predict PPIs using a comprehensive set of 930 features based only on sequence information, these features measure the interactions between residues a certain distant apart in the protein sequences from different aspects. To achieve better performance, the principal component analysis (PCA) is first employed to obtain an optimized feature subset. Then, the resulting 67-dimensional feature vectors are fed to Support Vector Machine (SVM). Experimental results on Drosophila melanogaster and Helicobater pylori datasets show that our method is very promising to predict PPIs and may at least be a useful supplement tool to existing methods.

  11. Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

    PubMed Central

    Zook, Justin M.; Samarov, Daniel; McDaniel, Jennifer; Sen, Shurjo K.; Salit, Marc

    2012-01-01

    While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. PMID:22859977

  12. PCR Primers to Study the Diversity of Expressed Fungal Genes Encoding Lignocellulolytic Enzymes in Soils Using High-Throughput Sequencing

    PubMed Central

    Barbi, Florian; Bragalini, Claudia; Vallon, Laurent; Prudent, Elsa; Dubost, Audrey; Fraissinet-Tachet, Laurence; Marmeisse, Roland; Luis, Patricia

    2014-01-01

    Plant biomass degradation in soil is one of the key steps of carbon cycling in terrestrial ecosystems. Fungal saprotrophic communities play an essential role in this process by producing hydrolytic enzymes active on the main components of plant organic matter. Open questions in this field regard the diversity of the species involved, the major biochemical pathways implicated and how these are affected by external factors such as litter quality or climate changes. This can be tackled by environmental genomic approaches involving the systematic sequencing of key enzyme-coding gene families using soil-extracted RNA as material. Such an approach necessitates the design and evaluation of gene family-specific PCR primers producing sequence fragments compatible with high-throughput sequencing approaches. In the present study, we developed and evaluated PCR primers for the specific amplification of fungal CAZy Glycoside Hydrolase gene families GH5 (subfamily 5) and GH11 encoding endo-β-1,4-glucanases and endo-β-1,4-xylanases respectively as well as Basidiomycota class II peroxidases, corresponding to the CAZy Auxiliary Activity family 2 (AA2), active on lignin. These primers were experimentally validated using DNA extracted from a wide range of Ascomycota and Basidiomycota species including 27 with sequenced genomes. Along with the published primers for Glycoside Hydrolase GH7 encoding enzymes active on cellulose, the newly design primers were shown to be compatible with the Illumina MiSeq sequencing technology. Sequences obtained from RNA extracted from beech or spruce forest soils showed a high diversity and were uniformly distributed in gene trees featuring the global diversity of these gene families. This high-throughput sequencing approach using several degenerate primers constitutes a robust method, which allows the simultaneous characterization of the diversity of different fungal transcripts involved in plant organic matter degradation and may lead to the discovery of complex patterns in gene expression of soil fungal communities. PMID:25545363

  13. Transferring the Characteristics of Naturally Occurring and Biased Antibody Repertoires to Human Antibody Libraries by Trapping CDRH3 Sequences

    PubMed Central

    Venet, Sophie; Ravn, Ulla; Buatois, Vanessa; Gueneau, Franck; Calloud, Sébastien; Kosco-Vilbois, Marie; Fischer, Nicolas

    2012-01-01

    Antibody repertoires are characterized by diversity as they vary not only amongst individuals and post antigen exposure but also differ significantly between vertebrate species. Such plasticity can be exploited to generate human antibody libraries featuring hallmarks of these diverse repertoires. In this study, the focus was to capture CDRH3 sequences, as this region generally accounts for most of the interaction energy with antigen. Sequences from human as well as non-human sources were successfully integrated into human antibody libraries. Next generation sequencing of these libraries proved that the CDRH3 lengths and amino acid composition corresponded to the species of origin. Specific CDRH3 sequences, biased towards the recognition of a model antigen either by immunizing mice or by selecting with phage display, were then integrated into another set of libraries. From these antigen biased libraries, highly potent antibodies were more frequently isolated, indicating that the characteristics of an immune repertoire is transferrable via CDRH3 sequences into a human antibody library. Taken together, these data demonstrate that the properties of naturally or experimentally biased repertoires can be effectively harnessed for the generation of targeted human antibody libraries, substantially increasing the probability of isolating antibodies suitable for therapeutic and diagnostic applications. PMID:22937053

  14. Repetitive sequences and epigenetic modification: inseparable partners play important roles in the evolution of plant sex chromosomes.

    PubMed

    Li, Shu-Fen; Zhang, Guo-Jun; Yuan, Jin-Hong; Deng, Chuan-Liang; Gao, Wu-Jun

    2016-05-01

    The present review discusses the roles of repetitive sequences played in plant sex chromosome evolution, and highlights epigenetic modification as potential mechanism of repetitive sequences involved in sex chromosome evolution. Sex determination in plants is mostly based on sex chromosomes. Classic theory proposes that sex chromosomes evolve from a specific pair of autosomes with emergence of a sex-determining gene(s). Subsequently, the newly formed sex chromosomes stop recombination in a small region around the sex-determining locus, and over time, the non-recombining region expands to almost all parts of the sex chromosomes. Accumulation of repetitive sequences, mostly transposable elements and tandem repeats, is a conspicuous feature of the non-recombining region of the Y chromosome, even in primitive one. Repetitive sequences may play multiple roles in sex chromosome evolution, such as triggering heterochromatization and causing recombination suppression, leading to structural and morphological differentiation of sex chromosomes, and promoting Y chromosome degeneration and X chromosome dosage compensation. In this article, we review the current status of this field, and based on preliminary evidence, we posit that repetitive sequences are involved in sex chromosome evolution probably via epigenetic modification, such as DNA and histone methylation, with small interfering RNAs as the mediator.

  15. Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy

    PubMed Central

    Zhang, Lina; Zhang, Chengjin; Gao, Rui; Yang, Runtao; Song, Qing

    2016-01-01

    Antioxidant proteins perform significant functions in maintaining oxidation/antioxidation balance and have potential therapies for some diseases. Accurate identification of antioxidant proteins could contribute to revealing physiological processes of oxidation/antioxidation balance and developing novel antioxidation-based drugs. In this study, an ensemble method is presented to predict antioxidant proteins with hybrid features, incorporating SSI (Secondary Structure Information), PSSM (Position Specific Scoring Matrix), RSA (Relative Solvent Accessibility), and CTD (Composition, Transition, Distribution). The prediction results of the ensemble predictor are determined by an average of prediction results of multiple base classifiers. Based on a classifier selection strategy, we obtain an optimal ensemble classifier composed of RF (Random Forest), SMO (Sequential Minimal Optimization), NNA (Nearest Neighbor Algorithm), and J48 with an accuracy of 0.925. A Relief combined with IFS (Incremental Feature Selection) method is adopted to obtain optimal features from hybrid features. With the optimal features, the ensemble method achieves improved performance with a sensitivity of 0.95, a specificity of 0.93, an accuracy of 0.94, and an MCC (Matthew’s Correlation Coefficient) of 0.880, far better than the existing method. To evaluate the prediction performance objectively, the proposed method is compared with existing methods on the same independent testing dataset. Encouragingly, our method performs better than previous studies. In addition, our method achieves more balanced performance with a sensitivity of 0.878 and a specificity of 0.860. These results suggest that the proposed ensemble method can be a potential candidate for antioxidant protein prediction. For public access, we develop a user-friendly web server for antioxidant protein identification that is freely accessible at http://antioxidant.weka.cc. PMID:27662651

  16. Distinct phenotype clusters in childhood inflammatory brain diseases: implications for diagnostic evaluation.

    PubMed

    Cellucci, Tania; Tyrrell, Pascal N; Twilt, Marinka; Sheikh, Shehla; Benseler, Susanne M

    2014-03-01

    To identify distinct clusters of children with inflammatory brain diseases based on clinical, laboratory, and imaging features at presentation, to assess which features contribute strongly to the development of clusters, and to compare additional features between the identified clusters. A single-center cohort study was performed with children who had been diagnosed as having an inflammatory brain disease between June 1, 1989 and December 31, 2010. Demographic, clinical, laboratory, neuroimaging, and histologic data at diagnosis were collected. K-means cluster analysis was performed to identify clusters of patients based on their presenting features. Associations between the clusters and patient variables, such as diagnoses, were determined. A total of 147 children (50% female; median age 8.8 years) were identified: 105 with primary central nervous system (CNS) vasculitis, 11 with secondary CNS vasculitis, 8 with neuronal antibody syndromes, 6 with postinfectious syndromes, and 17 with other inflammatory brain diseases. Three distinct clusters were identified. Paresis and speech deficits were the most common presenting features in cluster 1. Children in cluster 2 were likely to present with behavior changes, cognitive dysfunction, and seizures, while those in cluster 3 experienced ataxia, vision abnormalities, and seizures. Lesions seen on T2/fluid-attenuated inversion recovery sequences of magnetic resonance imaging were common in all clusters, but unilateral ischemic lesions were more prominent in cluster 1. The clusters were associated with specific diagnoses and diagnostic test results. Children with inflammatory brain diseases presented with distinct phenotypical patterns that are associated with specific diagnoses. This information may inform the development of a diagnostic classification of childhood inflammatory brain diseases and suggest that specific pathways of diagnostic evaluation are warranted. Copyright © 2014 by the American College of Rheumatology.

  17. A no-reference bitstream-based perceptual model for video quality estimation of videos affected by coding artifacts and packet losses

    NASA Astrophysics Data System (ADS)

    Pandremmenou, K.; Shahid, M.; Kondi, L. P.; Lövström, B.

    2015-03-01

    In this work, we propose a No-Reference (NR) bitstream-based model for predicting the quality of H.264/AVC video sequences, affected by both compression artifacts and transmission impairments. The proposed model is based on a feature extraction procedure, where a large number of features are calculated from the packet-loss impaired bitstream. Many of the features are firstly proposed in this work, and the specific set of the features as a whole is applied for the first time for making NR video quality predictions. All feature observations are taken as input to the Least Absolute Shrinkage and Selection Operator (LASSO) regression method. LASSO indicates the most important features, and using only them, it is possible to estimate the Mean Opinion Score (MOS) with high accuracy. Indicatively, we point out that only 13 features are able to produce a Pearson Correlation Coefficient of 0.92 with the MOS. Interestingly, the performance statistics we computed in order to assess our method for predicting the Structural Similarity Index and the Video Quality Metric are equally good. Thus, the obtained experimental results verified the suitability of the features selected by LASSO as well as the ability of LASSO in making accurate predictions through sparse modeling.

  18. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites

    PubMed Central

    Meinicke, Peter; Tech, Maike; Morgenstern, Burkhard; Merkl, Rainer

    2004-01-01

    Background Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals. Results We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon. Conclusions We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems. PMID:15511290

  19. BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes.

    PubMed

    Staňková, Helena; Hastie, Alex R; Chan, Saki; Vrána, Jan; Tulpová, Zuzana; Kubaláková, Marie; Visendi, Paul; Hayashi, Satomi; Luo, Mingcheng; Batley, Jacqueline; Edwards, David; Doležel, Jaroslav; Šimková, Hana

    2016-07-01

    The assembly of a reference genome sequence of bread wheat is challenging due to its specific features such as the genome size of 17 Gbp, polyploid nature and prevalence of repetitive sequences. BAC-by-BAC sequencing based on chromosomal physical maps, adopted by the International Wheat Genome Sequencing Consortium as the key strategy, reduces problems caused by the genome complexity and polyploidy, but the repeat content still hampers the sequence assembly. Availability of a high-resolution genomic map to guide sequence scaffolding and validate physical map and sequence assemblies would be highly beneficial to obtaining an accurate and complete genome sequence. Here, we chose the short arm of chromosome 7D (7DS) as a model to demonstrate for the first time that it is possible to couple chromosome flow sorting with genome mapping in nanochannel arrays and create a de novo genome map of a wheat chromosome. We constructed a high-resolution chromosome map composed of 371 contigs with an N50 of 1.3 Mb. Long DNA molecules achieved by our approach facilitated chromosome-scale analysis of repetitive sequences and revealed a ~800-kb array of tandem repeats intractable to current DNA sequencing technologies. Anchoring 7DS sequence assemblies obtained by clone-by-clone sequencing to the 7DS genome map provided a valuable tool to improve the BAC-contig physical map and validate sequence assembly on a chromosome-arm scale. Our results indicate that creating genome maps for the whole wheat genome in a chromosome-by-chromosome manner is feasible and that they will be an affordable tool to support the production of improved pseudomolecules. © 2016 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

  20. Processing Translational Motion Sequences.

    DTIC Science & Technology

    1982-10-01

    the initial ROADSIGN image using a (del)**2g mask with a width of 5 pixels The distinctiveness values were computed using features which were 5x5 pixel...the initial step size of the local search quite large. 34 4. EX P R g NTg The following experiments were performed using the roadsign and industrial...the initial image of the sequence. The third experiment involves processing the roadsign image sequence using the features extracted at the positions

  1. De novo centromere formation on a chromosome fragment in maize.

    PubMed

    Fu, Shulan; Lv, Zhenling; Gao, Zhi; Wu, Huajun; Pang, Junling; Zhang, Bing; Dong, Qianhua; Guo, Xiang; Wang, Xiu-Jie; Birchler, James A; Han, Fangpu

    2013-04-09

    The centromere is the part of the chromosome that organizes the kinetochore, which mediates chromosome movement during mitosis and meiosis. A small fragment from chromosome 3, named Duplication 3a (Dp3a), was described from UV-irradiated materials by Stadler and Roman in the 1940s [Stadler LJ, Roman H (1948) Genetics 33(3):273-303]. The genetic behavior of Dp3a is reminiscent of a ring chromosome, but fluoresecent in situ hybridization detected telomeres at both ends, suggesting a linear structure. This small chromosome has no detectable canonical centromeric sequences, but contains a site with protein features of functional centromeres such as CENH3, the centromere specific H3 histone variant, and CENP-C, a foundational kinetochore protein, suggesting the de novo formation of a centromere on the chromatin fragment. To examine the sequences associated with CENH3, chromatin immunoprecipitation was carried out with anti-CENH3 antibodies using material from young seedlings with and without the Dp3a chromosome. A novel peak was detected from the ChIP-Sequencing reads of the Dp3a sample. The peak spanned 350 kb within the long arm of chromosome 3 covering 22 genes. Collectively, these results define the behavior and molecular features of de novo centromere formation in the Dp3a chromosome, which may shed light on the initiation of new centromere sites during evolution.

  2. Sequencing of the Litchi Downy Blight Pathogen Reveals It Is a Phytophthora Species With Downy Mildew-Like Characteristics.

    PubMed

    Ye, Wenwu; Wang, Yang; Shen, Danyu; Li, Delong; Pu, Tianhuizi; Jiang, Zide; Zhang, Zhengguang; Zheng, Xiaobo; Tyler, Brett M; Wang, Yuanchao

    2016-07-01

    On the basis of its downy mildew-like morphology, the litchi downy blight pathogen was previously named Peronophythora litchii. Recently, however, it was proposed to transfer this pathogen to Phytophthora clade 4. To better characterize this unusual oomycete species and important fruit pathogen, we obtained the genome sequence of Phytophthora litchii and compared it to those from other oomycete species. P. litchii has a small genome with tightly spaced genes. On the basis of a multilocus phylogenetic analysis, the placement of P. litchii in the genus Phytophthora is strongly supported. Effector proteins predicted included 245 RxLR, 30 necrosis-and-ethylene-inducing protein-like, and 14 crinkler proteins. The typical motifs, phylogenies, and activities of these effectors were typical for a Phytophthora species. However, like the genome features of the analyzed downy mildews, P. litchii exhibited a streamlined genome with a relatively small number of genes in both core and species-specific protein families. The low GC content and slight codon preferences of P. litchii sequences were similar to those of the analyzed downy mildews and a subset of Phytophthora species. Taken together, these observations suggest that P. litchii is a Phytophthora pathogen that is in the process of acquiring downy mildew-like genomic and morphological features. Thus P. litchii may provide a novel model for investigating morphological development and genomic adaptation in oomycete pathogens.

  3. Toward a model for lexical access based on acoustic landmarks and distinctive features

    NASA Astrophysics Data System (ADS)

    Stevens, Kenneth N.

    2002-04-01

    This article describes a model in which the acoustic speech signal is processed to yield a discrete representation of the speech stream in terms of a sequence of segments, each of which is described by a set (or bundle) of binary distinctive features. These distinctive features specify the phonemic contrasts that are used in the language, such that a change in the value of a feature can potentially generate a new word. This model is a part of a more general model that derives a word sequence from this feature representation, the words being represented in a lexicon by sequences of feature bundles. The processing of the signal proceeds in three steps: (1) Detection of peaks, valleys, and discontinuities in particular frequency ranges of the signal leads to identification of acoustic landmarks. The type of landmark provides evidence for a subset of distinctive features called articulator-free features (e.g., [vowel], [consonant], [continuant]). (2) Acoustic parameters are derived from the signal near the landmarks to provide evidence for the actions of particular articulators, and acoustic cues are extracted by sampling selected attributes of these parameters in these regions. The selection of cues that are extracted depends on the type of landmark and on the environment in which it occurs. (3) The cues obtained in step (2) are combined, taking context into account, to provide estimates of ``articulator-bound'' features associated with each landmark (e.g., [lips], [high], [nasal]). These articulator-bound features, combined with the articulator-free features in (1), constitute the sequence of feature bundles that forms the output of the model. Examples of cues that are used, and justification for this selection, are given, as well as examples of the process of inferring the underlying features for a segment when there is variability in the signal due to enhancement gestures (recruited by a speaker to make a contrast more salient) or due to overlap of gestures from neighboring segments.

  4. The DNA Methylome of Human Peripheral Blood Mononuclear Cells

    PubMed Central

    Ye, Mingzhi; Zheng, Hancheng; Yu, Jian; Wu, Honglong; Sun, Jihua; Zhang, Hongyu; Chen, Quan; Luo, Ruibang; Chen, Minfeng; He, Yinghua; Jin, Xin; Zhang, Qinghui; Yu, Chang; Zhou, Guangyu; Sun, Jinfeng; Huang, Yebo; Zheng, Huisong; Cao, Hongzhi; Zhou, Xiaoyu; Guo, Shicheng; Hu, Xueda; Li, Xin; Kristiansen, Karsten; Bolund, Lars; Xu, Jiujin; Wang, Wen; Yang, Huanming; Wang, Jian; Li, Ruiqiang; Beck, Stephan; Wang, Jun; Zhang, Xiuqing

    2010-01-01

    DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and <0.2% of non-CpG sites were methylated, demonstrating that non-CpG cytosine methylation is minor in human PBMC. Analysis of the PBMC methylome revealed a rich epigenomic landscape for 20 distinct genomic features, including regulatory, protein-coding, non-coding, RNA-coding, and repeat sequences. Integration of our methylome data with the YH genome sequence enabled a first comprehensive assessment of allele-specific methylation (ASM) between the two haploid methylomes of any individual and allowed the identification of 599 haploid differentially methylated regions (hDMRs) covering 287 genes. Of these, 76 genes had hDMRs within 2 kb of their transcriptional start sites of which >80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies. PMID:21085693

  5. Enhanced Methods for Local Ancestry Assignment in Sequenced Admixed Individuals

    PubMed Central

    Brown, Robert; Pasaniuc, Bogdan

    2014-01-01

    Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs. PMID:24743331

  6. Prediction of residue-residue contact matrix for protein-protein interaction with Fisher score features and deep learning.

    PubMed

    Du, Tianchuan; Liao, Li; Wu, Cathy H; Sun, Bilin

    2016-11-01

    Protein-protein interactions play essential roles in many biological processes. Acquiring knowledge of the residue-residue contact information of two interacting proteins is not only helpful in annotating functions for proteins, but also critical for structure-based drug design. The prediction of the protein residue-residue contact matrix of the interfacial regions is challenging. In this work, we introduced deep learning techniques (specifically, stacked autoencoders) to build deep neural network models to tackled the residue-residue contact prediction problem. In tandem with interaction profile Hidden Markov Models, which was used first to extract Fisher score features from protein sequences, stacked autoencoders were deployed to extract and learn hidden abstract features. The deep learning model showed significant improvement over the traditional machine learning model, Support Vector Machines (SVM), with the overall accuracy increased by 15% from 65.40% to 80.82%. We showed that the stacked autoencoders could extract novel features, which can be utilized by deep neural networks and other classifiers to enhance learning, out of the Fisher score features. It is further shown that deep neural networks have significant advantages over SVM in making use of the newly extracted features. Copyright © 2016. Published by Elsevier Inc.

  7. The 3of5 web application for complex and comprehensive pattern matching in protein sequences.

    PubMed

    Seiler, Markus; Mehrle, Alexander; Poustka, Annemarie; Wiemann, Stefan

    2006-03-16

    The identification of patterns in biological sequences is a key challenge in genome analysis and in proteomics. Frequently such patterns are complex and highly variable, especially in protein sequences. They are frequently described using terms of regular expressions (RegEx) because of the user-friendly terminology. Limitations arise for queries with the increasing complexity of patterns and are accompanied by requirements for enhanced capabilities. This is especially true for patterns containing ambiguous characters and positions and/or length ambiguities. We have implemented the 3of5 web application in order to enable complex pattern matching in protein sequences. 3of5 is named after a special use of its main feature, the novel n-of-m pattern type. This feature allows for an extensive specification of variable patterns where the individual elements may vary in their position, order, and content within a defined stretch of sequence. The number of distinct elements can be constrained by operators, and individual characters may be excluded. The n-of-m pattern type can be combined with common regular expression terms and thus also allows for a comprehensive description of complex patterns. 3of5 increases the fidelity of pattern matching and finds ALL possible solutions in protein sequences in cases of length-ambiguous patterns instead of simply reporting the longest or shortest hits. Grouping and combined search for patterns provides a hierarchical arrangement of larger patterns sets. The algorithm is implemented as internet application and freely accessible. The application is available at http://dkfz.de/mga2/3of5/3of5.html. The 3of5 application offers an extended vocabulary for the definition of search patterns and thus allows the user to comprehensively specify and identify peptide patterns with variable elements. The n-of-m pattern type offers an improved accuracy for pattern matching in combination with the ability to find all solutions, without compromising the user friendliness of regular expression terms.

  8. Complex multi-enhancer contacts captured by genome architecture mapping.

    PubMed

    Beagrie, Robert A; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C A; Chotalia, Mita; Xie, Sheila Q; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R; Fraser, James; Dostie, Josée; Game, Laurence; Dillon, Niall; Edwards, Paul A W; Nicodemi, Mario; Pombo, Ana

    2017-03-23

    The organization of the genome in the nucleus and the interactions of genes with their regulatory elements are key features of transcriptional control and their disruption can cause disease. Here we report a genome-wide method, genome architecture mapping (GAM), for measuring chromatin contacts and other features of three-dimensional chromatin topology on the basis of sequencing DNA from a large collection of thin nuclear sections. We apply GAM to mouse embryonic stem cells and identify enrichment for specific interactions between active genes and enhancers across very large genomic distances using a mathematical model termed SLICE (statistical inference of co-segregation). GAM also reveals an abundance of three-way contacts across the genome, especially between regions that are highly transcribed or contain super-enhancers, providing a level of insight into genome architecture that, owing to the technical limitations of current technologies, has previously remained unattainable. Furthermore, GAM highlights a role for gene-expression-specific contacts in organizing the genome in mammalian nuclei.

  9. Song copying by humpback whales: themes and variations.

    PubMed

    Mercado, Eduardo; Herman, Louis M; Pack, Adam A

    2005-04-01

    Male humpback whales (Megaptera novaeangliae) produce long, structured sequences of sound underwater, commonly called "songs." Humpbacks progressively modify their songs over time in ways that suggest that individuals are copying song elements that they hear being used by other singers. Little is known about the factors that determine how whales learn from their auditory experiences. Song learning in birds is better understood and appears to be constrained by stable core attributes such as species-specific sound repertoires and song syntax. To clarify whether similar constraints exist for song learning by humpbacks, we analyzed changes over 14 years in the sounds used by humpback whales singing in Hawaiian waters. We found that although the properties of individual sounds within songs are quite variable over time, the overall distribution of certain acoustic features within the repertoire appears to be stable. In particular, our findings suggest that species-specific constraints on temporal features of song sounds determine song form, whereas spectral variability allows whales to flexibly adapt song elements.

  10. oPOSSUM: integrated tools for analysis of regulatory motif over-representation

    PubMed Central

    Ho Sui, Shannan J.; Fulton, Debra L.; Arenillas, David J.; Kwon, Andrew T.; Wasserman, Wyeth W.

    2007-01-01

    The identification of over-represented transcription factor binding sites from sets of co-expressed genes provides insights into the mechanisms of regulation for diverse biological contexts. oPOSSUM, an internet-based system for such studies of regulation, has been improved and expanded in this new release. New features include a worm-specific version for investigating binding sites conserved between Caenorhabditis elegans and C. briggsae, as well as a yeast-specific version for the analysis of co-expressed sets of Saccharomyces cerevisiae genes. The human and mouse applications feature improvements in ortholog mapping, sequence alignments and the delineation of multiple alternative promoters. oPOSSUM2, introduced for the analysis of over-represented combinations of motifs in human and mouse genes, has been integrated with the original oPOSSUM system. Analysis using user-defined background gene sets is now supported. The transcription factor binding site models have been updated to include new profiles from the JASPAR database. oPOSSUM is available at http://www.cisreg.ca/oPOSSUM/ PMID:17576675

  11. Computational Characterization of Exogenous MicroRNAs that Can Be Transferred into Human Circulation

    PubMed Central

    Shu, Jiang; Chiang, Kevin; Zempleni, Janos; Cui, Juan

    2015-01-01

    MicroRNAs have been long considered synthesized endogenously until very recent discoveries showing that human can absorb dietary microRNAs from animal and plant origins while the mechanism remains unknown. Compelling evidences of microRNAs from rice, milk, and honeysuckle transported to human blood and tissues have created a high volume of interests in the fundamental questions that which and how exogenous microRNAs can be transferred into human circulation and possibly exert functions in humans. Here we present an integrated genomics and computational analysis to study the potential deciding features of transportable microRNAs. Specifically, we analyzed all publicly available microRNAs, a total of 34,612 from 194 species, with 1,102 features derived from the microRNA sequence and structure. Through in-depth bioinformatics analysis, 8 groups of discriminative features have been used to characterize human circulating microRNAs and infer the likelihood that a microRNA will get transferred into human circulation. For example, 345 dietary microRNAs have been predicted as highly transportable candidates where 117 of them have identical sequences with their homologs in human and 73 are known to be associated with exosomes. Through a milk feeding experiment, we have validated 9 cow-milk microRNAs in human plasma using microRNA-sequencing analysis, including the top ranked microRNAs such as bta-miR-487b, miR-181b, and miR-421. The implications in health-related processes have been illustrated in the functional analysis. This work demonstrates the data-driven computational analysis is highly promising to study novel molecular characteristics of transportable microRNAs while bypassing the complex mechanistic details. PMID:26528912

  12. Computational Characterization of Exogenous MicroRNAs that Can Be Transferred into Human Circulation.

    PubMed

    Shu, Jiang; Chiang, Kevin; Zempleni, Janos; Cui, Juan

    2015-01-01

    MicroRNAs have been long considered synthesized endogenously until very recent discoveries showing that human can absorb dietary microRNAs from animal and plant origins while the mechanism remains unknown. Compelling evidences of microRNAs from rice, milk, and honeysuckle transported to human blood and tissues have created a high volume of interests in the fundamental questions that which and how exogenous microRNAs can be transferred into human circulation and possibly exert functions in humans. Here we present an integrated genomics and computational analysis to study the potential deciding features of transportable microRNAs. Specifically, we analyzed all publicly available microRNAs, a total of 34,612 from 194 species, with 1,102 features derived from the microRNA sequence and structure. Through in-depth bioinformatics analysis, 8 groups of discriminative features have been used to characterize human circulating microRNAs and infer the likelihood that a microRNA will get transferred into human circulation. For example, 345 dietary microRNAs have been predicted as highly transportable candidates where 117 of them have identical sequences with their homologs in human and 73 are known to be associated with exosomes. Through a milk feeding experiment, we have validated 9 cow-milk microRNAs in human plasma using microRNA-sequencing analysis, including the top ranked microRNAs such as bta-miR-487b, miR-181b, and miR-421. The implications in health-related processes have been illustrated in the functional analysis. This work demonstrates the data-driven computational analysis is highly promising to study novel molecular characteristics of transportable microRNAs while bypassing the complex mechanistic details.

  13. Clusters of ancestrally related genes that show paralogy in whole or in part are a major feature of the genomes of humans and other species.

    PubMed

    Walker, Michael B; King, Benjamin L; Paigen, Kenneth

    2012-01-01

    Arrangements of genes along chromosomes are a product of evolutionary processes, and we can expect that preferable arrangements will prevail over the span of evolutionary time, often being reflected in the non-random clustering of structurally and/or functionally related genes. Such non-random arrangements can arise by two distinct evolutionary processes: duplications of DNA sequences that give rise to clusters of genes sharing both sequence similarity and common sequence features and the migration together of genes related by function, but not by common descent. To provide a background for distinguishing between the two, which is important for future efforts to unravel the evolutionary processes involved, we here provide a description of the extent to which ancestrally related genes are found in proximity.Towards this purpose, we combined information from five genomic datasets, InterPro, SCOP, PANTHER, Ensembl protein families, and Ensembl gene paralogs. The results are provided in publicly available datasets (http://cgd.jax.org/datasets/clustering/paraclustering.shtml) describing the extent to which ancestrally related genes are in proximity beyond what is expected by chance (i.e. form paraclusters) in the human and nine other vertebrate genomes, as well as the D. melanogaster, C. elegans, A. thaliana, and S. cerevisiae genomes. With the exception of Saccharomyces, paraclusters are a common feature of the genomes we examined. In the human genome they are estimated to include at least 22% of all protein coding genes. Paraclusters are far more prevalent among some gene families than others, are highly species or clade specific and can evolve rapidly, sometimes in response to environmental cues. Altogether, they account for a large portion of the functional clustering previously reported in several genomes.

  14. Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context.

    PubMed

    Wang, Yong-Cui; Wang, Yong; Yang, Zhi-Xia; Deng, Nai-Yang

    2011-06-20

    Enzymes are known as the largest class of proteins and their functions are usually annotated by the Enzyme Commission (EC), which uses a hierarchy structure, i.e., four numbers separated by periods, to classify the function of enzymes. Automatically categorizing enzyme into the EC hierarchy is crucial to understand its specific molecular mechanism. In this paper, we introduce two key improvements in predicting enzyme function within the machine learning framework. One is to introduce the efficient sequence encoding methods for representing given proteins. The second one is to develop a structure-based prediction method with low computational complexity. In particular, we propose to use the conjoint triad feature (CTF) to represent the given protein sequences by considering not only the composition of amino acids but also the neighbor relationships in the sequence. Then we develop a support vector machine (SVM)-based method, named as SVMHL (SVM for hierarchy labels), to output enzyme function by fully considering the hierarchical structure of EC. The experimental results show that our SVMHL with the CTF outperforms SVMHL with the amino acid composition (AAC) feature both in predictive accuracy and Matthew's correlation coefficient (MCC). In addition, SVMHL with the CTF obtains the accuracy and MCC ranging from 81% to 98% and 0.82 to 0.98 when predicting the first three EC digits on a low-homologous enzyme dataset. We further demonstrate that our method outperforms the methods which do not take account of hierarchical relationship among enzyme categories and alternative methods which incorporate prior knowledge about inter-class relationships. Our structure-based prediction model, SVMHL with the CTF, reduces the computational complexity and outperforms the alternative approaches in enzyme function prediction. Therefore our new method will be a useful tool for enzyme function prediction community.

  15. Systematic analysis and evolution of 5S ribosomal DNA in metazoans.

    PubMed

    Vierna, J; Wehner, S; Höner zu Siederdissen, C; Martínez-Lage, A; Marz, M

    2013-11-01

    Several studies on 5S ribosomal DNA (5S rDNA) have been focused on a subset of the following features in mostly one organism: number of copies, pseudogenes, secondary structure, promoter and terminator characteristics, genomic arrangements, types of non-transcribed spacers and evolution. In this work, we systematically analyzed 5S rDNA sequence diversity in available metazoan genomes, and showed organism-specific and evolutionary-conserved features. Putatively functional sequences (12,766) from 97 organisms allowed us to identify general features of this multigene family in animals. Interestingly, we show that each mammal species has a highly conserved (housekeeping) 5S rRNA type and many variable ones. The genomic organization of 5S rDNA is still under debate. Here, we report the occurrence of several paralog 5S rRNA sequences in 58 of the examined species, and a flexible genome organization of 5S rDNA in animals. We found heterogeneous 5S rDNA clusters in several species, supporting the hypothesis of an exchange of 5S rDNA from one locus to another. A rather high degree of variation of upstream, internal and downstream putative regulatory regions appears to characterize metazoan 5S rDNA. We systematically studied the internal promoters and described three different types of termination signals, as well as variable distances between the coding region and the typical termination signal. Finally, we present a statistical method for detection of linkage among noncoding RNA (ncRNA) gene families. This method showed no evolutionary-conserved linkage among 5S rDNAs and any other ncRNA genes within Metazoa, even though we found 5S rDNA to be linked to various ncRNAs in several clades.

  16. Systematic analysis and evolution of 5S ribosomal DNA in metazoans

    PubMed Central

    Vierna, J; Wehner, S; Höner zu Siederdissen, C; Martínez-Lage, A; Marz, M

    2013-01-01

    Several studies on 5S ribosomal DNA (5S rDNA) have been focused on a subset of the following features in mostly one organism: number of copies, pseudogenes, secondary structure, promoter and terminator characteristics, genomic arrangements, types of non-transcribed spacers and evolution. In this work, we systematically analyzed 5S rDNA sequence diversity in available metazoan genomes, and showed organism-specific and evolutionary-conserved features. Putatively functional sequences (12 766) from 97 organisms allowed us to identify general features of this multigene family in animals. Interestingly, we show that each mammal species has a highly conserved (housekeeping) 5S rRNA type and many variable ones. The genomic organization of 5S rDNA is still under debate. Here, we report the occurrence of several paralog 5S rRNA sequences in 58 of the examined species, and a flexible genome organization of 5S rDNA in animals. We found heterogeneous 5S rDNA clusters in several species, supporting the hypothesis of an exchange of 5S rDNA from one locus to another. A rather high degree of variation of upstream, internal and downstream putative regulatory regions appears to characterize metazoan 5S rDNA. We systematically studied the internal promoters and described three different types of termination signals, as well as variable distances between the coding region and the typical termination signal. Finally, we present a statistical method for detection of linkage among noncoding RNA (ncRNA) gene families. This method showed no evolutionary-conserved linkage among 5S rDNAs and any other ncRNA genes within Metazoa, even though we found 5S rDNA to be linked to various ncRNAs in several clades. PMID:23838690

  17. A characteristic of polymorphic membrane protein F of Chlamydia trachomatis isolated from male urogenital tracts in Japan.

    PubMed

    Yamazaki, Tomohiro; Matsuo, Junji; Takahashi, Satoshi; Kumagai, Shouta; Shimoda, Tomoko; Abe, Kiyotaka; Minami, Kunihiro; Yamaguchi, Hiroyuki

    2015-12-01

    Although sexually transmitted disease due to Chlamydia trachomatis occurs similarly in both men and women, the female urogenital tract differs from that of males anatomically and physiologically, possibly leading to specific polymorphisms of the bacterial surface molecules. In the present study, we therefore characterized polymorphic features in a high-definition phylogenetic marker, polymorphic outer membrane protein (Pmp) F of C. trachomatis strains isolated from male urogenital tracts in Japan (Category: Japan-males, n = 12), when compared with those isolated from female cervical ducts in Japan (Category: Japan-females, n = 11), female cervical ducts in the other country (Category: Ref-females, n = 12) or homosexual male rectums in the other country (Category: Ref-males, n = 7), by general bioinformatics analysis tool with MAFFT software. As a result, phylogenetic reconstruction of the PmpF amino acid sequences showing three distinct clusters revealed that the Japan-males were limited into cluster 1 and 2, although there were only four clusters even though including an outgroup. Meanwhile, the phylogenetic distance values of PmpF passenger domain without hinge region, but not its full-length sequence, showed that the Japan-males were more stable and displayed less diversity when compared with the other categories, supported by the sequence conservation features. Thus, PmpF passenger domain is a useful phylogenetic maker, and the phylogenic features indicate that C. trachomatis strains isolated from male urogenital tracts in Japan may be unique, suggesting an adaptation depending on selective pressure, such as the presence or absence of microbial flora, furthermore possibly connecting to sexual differentiation. Copyright © 2015 Japanese Society of Chemotherapy and The Japanese Association for Infectious Diseases. Published by Elsevier Ltd. All rights reserved.

  18. Protein fold recognition using geometric kernel data fusion.

    PubMed

    Zakeri, Pooya; Jeuris, Ben; Vandebril, Raf; Moreau, Yves

    2014-07-01

    Various approaches based on features extracted from protein sequences and often machine learning methods have been used in the prediction of protein folds. Finding an efficient technique for integrating these different protein features has received increasing attention. In particular, kernel methods are an interesting class of techniques for integrating heterogeneous data. Various methods have been proposed to fuse multiple kernels. Most techniques for multiple kernel learning focus on learning a convex linear combination of base kernels. In addition to the limitation of linear combinations, working with such approaches could cause a loss of potentially useful information. We design several techniques to combine kernel matrices by taking more involved, geometry inspired means of these matrices instead of convex linear combinations. We consider various sequence-based protein features including information extracted directly from position-specific scoring matrices and local sequence alignment. We evaluate our methods for classification on the SCOP PDB-40D benchmark dataset for protein fold recognition. The best overall accuracy on the protein fold recognition test set obtained by our methods is ∼ 86.7%. This is an improvement over the results of the best existing approach. Moreover, our computational model has been developed by incorporating the functional domain composition of proteins through a hybridization model. It is observed that by using our proposed hybridization model, the protein fold recognition accuracy is further improved to 89.30%. Furthermore, we investigate the performance of our approach on the protein remote homology detection problem by fusing multiple string kernels. The MATLAB code used for our proposed geometric kernel fusion frameworks are publicly available at http://people.cs.kuleuven.be/∼raf.vandebril/homepage/software/geomean.php?menu=5/. © The Author 2014. Published by Oxford University Press.

  19. Molecular beacon probes-base multiplex NASBA Real-time for detection of HIV-1 and HCV.

    PubMed

    Mohammadi-Yeganeh, S; Paryan, M; Mirab Samiee, S; Kia, V; Rezvan, H

    2012-06-01

    Developed in 1991, nucleic acid sequence-based amplification (NASBA) has been introduced as a rapid molecular diagnostic technique, where it has been shown to give quicker results than PCR, and it can also be more sensitive. This paper describes the development of a molecular beacon-based multiplex NASBA assay for simultaneous detection of HIV-1 and HCV in plasma samples. A well-conserved region in the HIV-1 pol gene and 5'-NCR of HCV genome were used for primers and molecular beacon design. The performance features of HCV/HIV-1 multiplex NASBA assay including analytical sensitivity and specificity, clinical sensitivity and clinical specificity were evaluated. The analysis of scalar concentrations of the samples indicated that the limit of quantification of the assay was <1000 copies/ml for HIV-1 and <500 copies/ml for HCV with 95% confidence interval. Multiplex NASBA assay showed a 98% sensitivity and 100% specificity. The analytical specificity study with BLAST software demonstrated that the primers do not attach to any other sequences except for that of HIV-1 or HCV. The primers and molecular beacon probes detected all HCV genotypes and all major variants of HIV-1. This method may represent a relatively inexpensive isothermal method for detection of HIV-1/HCV co-infection in monitoring of patients.

  20. Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion.

    PubMed

    Zhao, Shanrong; Zhang, Ying; Gamini, Ramya; Zhang, Baohong; von Schack, David

    2018-03-19

    To allow efficient transcript/gene detection, highly abundant ribosomal RNAs (rRNA) are generally removed from total RNA either by positive polyA+ selection or by rRNA depletion (negative selection) before sequencing. Comparisons between the two methods have been carried out by various groups, but the assessments have relied largely on non-clinical samples. In this study, we evaluated these two RNA sequencing approaches using human blood and colon tissue samples. Our analyses showed that rRNA depletion captured more unique transcriptome features, whereas polyA+ selection outperformed rRNA depletion with higher exonic coverage and better accuracy of gene quantification. For blood- and colon-derived RNAs, we found that 220% and 50% more reads, respectively, would have to be sequenced to achieve the same level of exonic coverage in the rRNA depletion method compared with the polyA+ selection method. Therefore, in most cases we strongly recommend polyA+ selection over rRNA depletion for gene quantification in clinical RNA sequencing. Our evaluation revealed that a small number of lncRNAs and small RNAs made up a large fraction of the reads in the rRNA depletion RNA sequencing data. Thus, we recommend that these RNAs are specifically depleted to improve the sequencing depth of the remaining RNAs.

Top