Science.gov

Sample records for acid sequence predicted

  1. Prediction of protein antigenic determinants from amino acid sequences

    SciTech Connect

    Hopp, T.P.; Woods, K.R.

    1981-06-01

    A method is presented for locating protein antigenic determinants by analyzing amino acid sequences in order to find the point of greatest local hydrophilicity. This is accomplished by assigning each amino acid a numerical value (hydrophilicity value) and then repetitively averaging these values along the peptide chain. The point of highest local average hydrophilicity is invariably located in, or immediately adjacent to, an antigenic determinant. It was found that the prediction success rate depended on averaging group length, with hexapeptide averages yielding optimal results. The method was developed using 12 proteins for which extensive immunochemical analysis has been carried out and subsequently was used to predict antigenic determinants for the following proteins: hepatitis B surface antigen, influenza hemagglutinis, fowl plague virus hemagglutinin, human histocompatibility antigen HLA-B7, human interferons, Escherichia coli and cholera enterotoxins, ragweed allergens Ra3 and Ra5, and streptococcal M protein. The hepatitis B surface antigen sequence was synthesized by chemical means and was shown to have antigenic activity by radioimmunoassay.

  2. Predicting protein amidation sites by orchestrating amino acid sequence features

    NASA Astrophysics Data System (ADS)

    Zhao, Shuqiu; Yu, Hua; Gong, Xiujun

    2017-08-01

    Amidation is the fourth major category of post-translational modifications, which plays an important role in physiological and pathological processes. Identifying amidation sites can help us understanding the amidation and recognizing the original reason of many kinds of diseases. But the traditional experimental methods for predicting amidation sites are often time-consuming and expensive. In this study, we propose a computational method for predicting amidation sites by orchestrating amino acid sequence features. Three kinds of feature extraction methods are used to build a feature vector enabling to capture not only the physicochemical properties but also position related information of the amino acids. An extremely randomized trees algorithm is applied to choose the optimal features to remove redundancy and dependence among components of the feature vector by a supervised fashion. Finally the support vector machine classifier is used to label the amidation sites. When tested on an independent data set, it shows that the proposed method performs better than all the previous ones with the prediction accuracy of 0.962 at the Matthew's correlation coefficient of 0.89 and area under curve of 0.964.

  3. PrDOS: prediction of disordered protein regions from amino acid sequence.

    PubMed

    Ishida, Takashi; Kinoshita, Kengo

    2007-07-01

    PrDOS is a server that predicts the disordered regions of a protein from its amino acid sequence (http://prdos.hgc.jp). The server accepts a single protein amino acid sequence, in either plain text or FASTA format. The prediction system is composed of two predictors: a predictor based on local amino acid sequence information and one based on template proteins. The server combines the results of the two predictors and returns a two-state prediction (order/disorder) and a disorder probability for each residue. The prediction results are sent by e-mail, and the server also provides a web-interface to check the results.

  4. SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues

    PubMed Central

    Sun, Jun; Liu, Rong

    2015-01-01

    Protein-nucleic acid interactions are central to various fundamental biological processes. Automated methods capable of reliably identifying DNA- and RNA-binding residues in protein sequence are assuming ever-increasing importance. The majority of current algorithms rely on feature-based prediction, but their accuracy remains to be further improved. Here we propose a sequence-based hybrid algorithm SNBRFinder (Sequence-based Nucleic acid-Binding Residue Finder) by merging a feature predictor SNBRFinderF and a template predictor SNBRFinderT. SNBRFinderF was established using the support vector machine whose inputs include sequence profile and other complementary sequence descriptors, while SNBRFinderT was implemented with the sequence alignment algorithm based on profile hidden Markov models to capture the weakly homologous template of query sequence. Experimental results show that SNBRFinderF was clearly superior to the commonly used sequence profile-based predictor and SNBRFinderT can achieve comparable performance to the structure-based template methods. Leveraging the complementary relationship between these two predictors, SNBRFinder reasonably improved the performance of both DNA- and RNA-binding residue predictions. More importantly, the sequence-based hybrid prediction reached competitive performance relative to our previous structure-based counterpart. Our extensive and stringent comparisons show that SNBRFinder has obvious advantages over the existing sequence-based prediction algorithms. The value of our algorithm is highlighted by establishing an easy-to-use web server that is freely accessible at http://ibi.hzau.edu.cn/SNBRFinder. PMID:26176857

  5. Feature selection from short amino acid sequences in phosphorylation prediction problem

    NASA Astrophysics Data System (ADS)

    Wecławski, Jakub; Jankowski, Stanisław; Szymański, Zbigniew

    The paper describes solution of feature selection from amino acid sequences in phosphorylation prediction problem. We show that even for short sequences the variable selection leads to better classification performance. Moreover, the final simplicity of models allows for better data understanding and can be used by an expert for further analysis. The feature selection process is divided into two parts: i) the classification tree is used for finding the most relevant positions in amino acid sequences, ii) then the contrast pattern kernel is applied for pattern selection. This work summarizes the research made on classification of short amino acid sequences. The results of the research allowed us to propose a general scheme of amino acid sequence analysis.

  6. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs

    PubMed Central

    Chen, Ke; Kurgan, Lukasz A; Ruan, Jishou

    2007-01-01

    Background Traditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D) protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP); the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction. Results The flexible/rigid regions were defined based on a dataset, which includes protein sequences that have multiple experimental structures, and which was previously used to study the structural conservation of proteins. Sequences drawn from this dataset were represented based on feature sets that were proposed in prior research, such as PSI-BLAST profiles, composition vector and binary sequence encoding, and a newly proposed representation based on frequencies of k-spaced amino acid pairs. These representations were processed by feature selection to reduce the dimensionality. Several machine learning methods for the prediction of flexible/rigid regions and two recently proposed methods for the prediction of conformational changes and unstructured regions were compared with the proposed method. The FlexRP method, which applies Logistic Regression and collocation-based representation with 95 features, obtained 79.5% accuracy. The two runner-up methods, which apply the same sequence representation and Support Vector Machines (SVM) and Naïve Bayes classifiers, obtained 79.2% and 78.4% accuracy, respectively. The remaining considered methods are characterized by accuracies below 70

  7. Fast computational methods for predicting protein structure from primary amino acid sequence

    DOEpatents

    Agarwal, Pratul Kumar

    2011-07-19

    The present invention provides a method utilizing primary amino acid sequence of a protein, energy minimization, molecular dynamics and protein vibrational modes to predict three-dimensional structure of a protein. The present invention also determines possible intermediates in the protein folding pathway. The present invention has important applications to the design of novel drugs as well as protein engineering. The present invention predicts the three-dimensional structure of a protein independent of size of the protein, overcoming a significant limitation in the prior art.

  8. Analysis of protein function and its prediction from amino acid sequence.

    PubMed

    Clark, Wyatt T; Radivojac, Predrag

    2011-07-01

    Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in the context of human disease because many conditions arise as a consequence of alterations of protein function. The recent availability of relatively inexpensive sequencing technology has resulted in thousands of complete or partially sequenced genomes with millions of functionally uncharacterized proteins. Such a large volume of data, combined with the lack of high-throughput experimental assays to functionally annotate proteins, attributes to the growing importance of automated function prediction. Here, we study proteins annotated by Gene Ontology (GO) terms and estimate the accuracy of functional transfer from protein sequence only. We find that the transfer of GO terms by pairwise sequence alignments is only moderately accurate, showing a surprisingly small influence of sequence identity (SID) in a broad range (30-100%). We developed and evaluated a new predictor of protein function, functional annotator (FANN), from amino acid sequence. The predictor exploits a multioutput neural network framework which is well suited to simultaneously modeling dependencies between functional terms. Experiments provide evidence that FANN-GO (predictor of GO terms; available from http://www.informatics.indiana.edu/predrag) outperforms standard methods such as transfer by global or local SID as well as GOtcha, a method that incorporates the structure of GO.

  9. The Use of Orthologous Sequences to Predict the Impact of Amino Acid Substitutions on Protein Function

    PubMed Central

    Rine, Jasper

    2010-01-01

    Computational predictions of the functional impact of genetic variation play a critical role in human genetics research. For nonsynonymous coding variants, most prediction algorithms make use of patterns of amino acid substitutions observed among homologous proteins at a given site. In particular, substitutions observed in orthologous proteins from other species are often assumed to be tolerated in the human protein as well. We examined this assumption by evaluating a panel of nonsynonymous mutants of a prototypical human enzyme, methylenetetrahydrofolate reductase (MTHFR), in a yeast cell-based functional assay. As expected, substitutions in human MTHFR at sites that are well-conserved across distant orthologs result in an impaired enzyme, while substitutions present in recently diverged sequences (including a 9-site mutant that “resurrects” the human-macaque ancestor) result in a functional enzyme. We also interrogated 30 sites with varying degrees of conservation by creating substitutions in the human enzyme that are accepted in at least one ortholog of MTHFR. Quite surprisingly, most of these substitutions were deleterious to the human enzyme. The results suggest that selective constraints vary between phylogenetic lineages such that inclusion of distant orthologs to infer selective pressures on the human enzyme may be misleading. We propose that homologous proteins are best used to reconstruct ancestral sequences and infer amino acid conservation among only direct lineal ancestors of a particular protein. We show that such an “ancestral site preservation” measure outperforms other prediction methods, not only in our selected set for MTHFR, but also in an exhaustive set of E. coli LacI mutants. PMID:20523748

  10. Gene sequence and predicted amino acid sequence of the motA protein, a membrane-associated protein required for flagellar rotation in Escherichia coli.

    PubMed Central

    Dean, G E; Macnab, R M; Stader, J; Matsumura, P; Burks, C

    1984-01-01

    The motA and motB gene products of Escherichia coli are integral membrane proteins necessary for flagellar rotation. We determined the DNA sequence of the region containing the motA gene and its promoter. Within this sequence, there is an open reading frame of 885 nucleotides, which with high probability (98% confidence level) meets criteria for a coding sequence. The 295-residue amino acid translation product had a molecular weight of 31,974, in good agreement with the value determined experimentally by gel electrophoresis. The amino acid sequence, which was quite hydrophobic, was subjected to a theoretical analysis designed to predict membrane-spanning alpha-helical segments of integral membrane proteins; four such hydrophobic helices were predicted by this treatment. Additional amphipathic helices may also be present. A remarkable feature of the sequence is the existence of two segments of high uncompensated charge density, one positive and the other negative. Possible organization of the protein in the membrane is discussed. Asymmetry in the amino acid composition of translated DNA sequences was used to distinguish between two possible initiation codons. The use of this method as a criterion for authentication of coding regions is described briefly in an Appendix. PMID:6090403

  11. Prediction of protein structural class for low-similarity sequences using Chou's pseudo amino acid composition and wavelet denoising.

    PubMed

    Yu, Bin; Lou, Lifeng; Li, Shan; Zhang, Yusen; Qiu, Wenying; Wu, Xue; Wang, Minghui; Tian, Baoguang

    2017-09-01

    Prediction of protein structural class plays an important role in protein structure and function analysis, drug design and many other biological applications. Prediction of protein structural class for low-similarity sequences is still a challenging task. Based on the theory of wavelet denoising, this paper presents a novel method of prediction of protein structural class for the first time. Firstly, the features of the protein sequence are extracted by using Chou's pseudo amino acid composition (PseAAC). Then the extracted feature information is denoised by two-dimensional (2D) wavelet. Finally, the optimal feature vectors are input to support vector machine (SVM) classifier to predict protein structural classes. We obtained significant predictive results using jackknife test on three low-similarity protein structural class datasets 25PDB, 1189 and 640, and compared our method with previous methods The results indicate that the method proposed in this paper can effectively improve the prediction accuracy of protein structural class, which will be a reliable tool for prediction of protein structural class, especially for low-similarity sequences. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

    PubMed

    Emanuelsson, O; Nielsen, H; Brunak, S; von Heijne, G

    2000-07-21

    A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.

  13. Subcellular location prediction of proteins using support vector machines with alignment of block sequences utilizing amino acid composition.

    PubMed

    Tamura, Takeyuki; Akutsu, Tatsuya

    2007-11-30

    Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies. In this paper, we propose a novel and general predicting method by combining techniques for sequence alignment and feature vectors based on amino acid composition. We implemented this method with support vector machines on plant data sets extracted from the TargetP database. Through fivefold cross validation tests, the obtained overall accuracies and average MCC were 0.9096 and 0.8655 respectively. We also applied our method to other datasets including that of WoLF PSORT. Although there is a predictor which uses the information of gene ontology and yields higher accuracy than ours, our accuracies are higher than existing predictors which use only sequence information. Since such information as gene ontology can be obtained only for known proteins, our predictor is considered to be useful for subcellular location prediction of newly-discovered proteins. Furthermore, the idea of combination of alignment and amino acid frequency is novel and general so that it may be applied to other problems in bioinformatics. Our method for plant is also implemented as a web-system and available on http://sunflower.kuicr.kyoto-u.ac.jp/~tamura/slpfa.html.

  14. Predicting Protein–Protein Interaction Sites Using Sequence Descriptors and Site Propensity of Neighboring Amino Acids

    PubMed Central

    Kuo, Tzu-Hao; Li, Kuo-Bin

    2016-01-01

    Information about the interface sites of Protein–Protein Interactions (PPIs) is useful for many biological research works. However, despite the advancement of experimental techniques, the identification of PPI sites still remains as a challenging task. Using a statistical learning technique, we proposed a computational tool for predicting PPI interaction sites. As an alternative to similar approaches requiring structural information, the proposed method takes all of the input from protein sequences. In addition to typical sequence features, our method takes into consideration that interaction sites are not randomly distributed over the protein sequence. We characterized this positional preference using protein complexes with known structures, proposed a numerical index to estimate the propensity and then incorporated the index into a learning system. The resulting predictor, without using structural information, yields an area under the ROC curve (AUC) of 0.675, recall of 0.597, precision of 0.311 and accuracy of 0.583 on a ten-fold cross-validation experiment. This performance is comparable to the previous approach in which structural information was used. Upon introducing the B-factor data to our predictor, we demonstrated that the AUC can be further improved to 0.750. The tool is accessible at http://bsaltools.ym.edu.tw/predppis. PMID:27792167

  15. Prediction of posttranslational modification sites from amino acid sequences with kernel methods.

    PubMed

    Xu, Yan; Wang, Xiaobo; Wang, Yongcui; Tian, Yingjie; Shao, Xiaojian; Wu, Ling-Yun; Deng, Naiyang

    2014-03-07

    Post-translational modification (PTM) is the chemical modification of a protein after its translation and one of the later steps in protein biosynthesis for many proteins. It plays an important role which modifies the end product of gene expression and contributes to biological processes and diseased conditions. However, the experimental methods for identifying PTM sites are both costly and time-consuming. Hence computational methods are highly desired. In this work, a novel encoding method PSPM (position-specific propensity matrices) is developed. Then a support vector machine (SVM) with the kernel matrix computed by PSPM is applied to predict the PTM sites. The experimental results indicate that the performance of new method is better or comparable with the existing methods. Therefore, the new method is a useful computational resource for the identification of PTM sites. A unified standalone software PTMPred is developed. It can be used to predict all types of PTM sites if the user provides the training datasets. The software can be freely downloaded from http://www.aporc.org/doc/wiki/PTMPred.

  16. Hybridization properties of long nucleic acid probes for detection of variable target sequences, and development of a hybridization prediction algorithm.

    PubMed

    Ohrmalm, Christina; Jobs, Magnus; Eriksson, Ronnie; Golbob, Sultan; Elfaitouri, Amal; Benachenhou, Farid; Strømme, Maria; Blomberg, Jonas

    2010-11-01

    One of the main problems in nucleic acid-based techniques for detection of infectious agents, such as influenza viruses, is that of nucleic acid sequence variation. DNA probes, 70-nt long, some including the nucleotide analog deoxyribose-Inosine (dInosine), were analyzed for hybridization tolerance to different amounts and distributions of mismatching bases, e.g. synonymous mutations, in target DNA. Microsphere-linked 70-mer probes were hybridized in 3M TMAC buffer to biotinylated single-stranded (ss) DNA for subsequent analysis in a Luminex® system. When mismatches interrupted contiguous matching stretches of 6 nt or longer, it had a strong impact on hybridization. Contiguous matching stretches are more important than the same number of matching nucleotides separated by mismatches into several regions. dInosine, but not 5-nitroindole, substitutions at mismatching positions stabilized hybridization remarkably well, comparable to N (4-fold) wobbles in the same positions. In contrast to shorter probes, 70-nt probes with judiciously placed dInosine substitutions and/or wobble positions were remarkably mismatch tolerant, with preserved specificity. An algorithm, NucZip, was constructed to model the nucleation and zipping phases of hybridization, integrating both local and distant binding contributions. It predicted hybridization more exactly than previous algorithms, and has the potential to guide the design of variation-tolerant yet specific probes.

  17. Hybridization properties of long nucleic acid probes for detection of variable target sequences, and development of a hybridization prediction algorithm

    PubMed Central

    Öhrmalm, Christina; Jobs, Magnus; Eriksson, Ronnie; Golbob, Sultan; Elfaitouri, Amal; Benachenhou, Farid; Strømme, Maria; Blomberg, Jonas

    2010-01-01

    One of the main problems in nucleic acid-based techniques for detection of infectious agents, such as influenza viruses, is that of nucleic acid sequence variation. DNA probes, 70-nt long, some including the nucleotide analog deoxyribose-Inosine (dInosine), were analyzed for hybridization tolerance to different amounts and distributions of mismatching bases, e.g. synonymous mutations, in target DNA. Microsphere-linked 70-mer probes were hybridized in 3M TMAC buffer to biotinylated single-stranded (ss) DNA for subsequent analysis in a Luminex® system. When mismatches interrupted contiguous matching stretches of 6 nt or longer, it had a strong impact on hybridization. Contiguous matching stretches are more important than the same number of matching nucleotides separated by mismatches into several regions. dInosine, but not 5-nitroindole, substitutions at mismatching positions stabilized hybridization remarkably well, comparable to N (4-fold) wobbles in the same positions. In contrast to shorter probes, 70-nt probes with judiciously placed dInosine substitutions and/or wobble positions were remarkably mismatch tolerant, with preserved specificity. An algorithm, NucZip, was constructed to model the nucleation and zipping phases of hybridization, integrating both local and distant binding contributions. It predicted hybridization more exactly than previous algorithms, and has the potential to guide the design of variation-tolerant yet specific probes. PMID:20864443

  18. Immunoreactivity of polyclonal antibodies generated against the carboxy terminus of the predicted amino acid sequence of the Huntington disease gene

    SciTech Connect

    Alkatib, G.; Graham, R.; Pelmear-Telenius, A.

    1994-09-01

    A cDNA fragment spanning the 3{prime}-end of the Huntington disease gene (from 8052 to 9252) was cloned into a prokaryotic expression vector containing the E. Coli lac promoter and a portion of the coding sequence for {beta}-galactosidase. The truncated {beta}-galactosidase gene was cleaved with BamHl and fused in frame to the BamHl fragment of the Huntington disease gene 3{prime}-end. Expression analysis of proteins made in E. Coli revealed that 20-30% of the total cellular proteins was represented by the {beta}-galactosidase-huntingtin fusion protein. The identity of the Huntington disease protein amino acid sequences was confirmed by protein sequence analysis. Affinity chromatography was used to purify large quantities of the fusion protein from bacterial cell lysates. Affinity-purified proteins were used to immunize New Zealand white rabbits for antibody production. The generated polyclonal antibodies were used to immunoprecipitate the Huntington disease gene product expressed in a neuroblastoma cell line. In this cell line the antibodies precipitated two protein bands of apparent gel migrations of 200 and 150 kd which together, correspond to the calculated molecular weight of the Huntington disease gene product (350 kd). Immunoblotting experiments revealed the presence of a large precursor protein in the range of 350-750 kd which is in agreement with the predicted molecular weight of the protein without post-translational modifications. These results indicate that the huntingtin protein is cleaved into two subunits in this neuroblastoma cell line and implicate that cleavage of a large precursor protein may contribute to its biological activity. Experiments are ongoing to determine the precursor-product relationship and to examine the synthesis of the huntingtin protein in freshly isolated rat brains, and to determine cellular and subcellular distribution of the gene product.

  19. Nucleotide sequence of the Klebsiella pneumoniae nifD gene and predicted amino acid sequence of the alpha-subunit of nitrogenase MoFe protein.

    PubMed Central

    Ioannidis, I; Buck, M

    1987-01-01

    The nucleotide sequence of the Klebsiella pneumoniae nifD gene is presented and together with the accompanying paper [Holland, Zilberstein, Zamir & Sussman (1987) Biochem. J. 247, 277-285] completes the sequence of the nifHDK genes encoding the nitrogenase polypeptides. The K. pneumoniae nifD gene encodes the 483-amino acid-residue nitrogenase alpha-subunit polypeptide of Mr 54156. The alpha-subunit has five strongly conserved cysteine residues at positions 63, 89, 155, 184 and 275, some occurring in a region showing both primary sequence and potential structural homology to the K. pneumoniae nitrogenase beta-subunit. A comparison with six other alpha-subunit amino acid sequences has been made, which indicates a number of potentially important domains within alpha-subunits. PMID:3322262

  20. Coronavirus genome: prediction of putative functional domains in the non-structural polyprotein by comparative amino acid sequence analysis.

    PubMed Central

    Gorbalenya, A E; Koonin, E V; Donchenko, A P; Blinov, V M

    1989-01-01

    Amino acid sequences of 2 giant non-structural polyproteins (F1 and F2) of infectious bronchitis virus (IBV), a member of Coronaviridae, were compared, by computer-assisted methods, to sequences of a number of other positive strand RNA viral and cellular proteins. By this approach, juxtaposed putative RNA-dependent RNA polymerase, nucleic acid binding ("finger"-like) and RNA helicase domains were identified in F2. Together, these domains might constitute the core of the protein complex involved in the primer-dependent transcription, replication and recombination of coronaviruses. In F1, two cysteine protease-like domains and a growth factor-like one were revealed. One of the putative proteases of IBV is similar to 3C proteases of picornaviruses and related enzymes of como- nepo- and potyviruses. Search of IBV F1 and F2 sequences for sites similar to those cleaved by the latter proteases and intercomparison of the surrounding sequence stretches revealed 13 dipeptides Q/S(G) which are probably cleaved by the coronavirus 3C-like protease. Based on these observations, a partial tentative scheme for the functional organization and expression strategy of the non-structural polyproteins of IBV was proposed. It implies that, despite the general similarity to other positive strand RNA viruses, and particularly to potyviruses, coronaviruses possess a number of unique structural and functional features. PMID:2526320

  1. Composition for nucleic acid sequencing

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2008-08-26

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  2. Sequence Alignment to Predict Across Species Susceptibility ...

    EPA Pesticide Factsheets

    Conservation of a molecular target across species can be used as a line-of-evidence to predict the likelihood of chemical susceptibility. The web-based Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) tool was developed to simplify, streamline, and quantitatively assess protein sequence/structural similarity across taxonomic groups as a means to predict relative intrinsic susceptibility. The intent of the tool is to allow for evaluation of any potential protein target, so it is amenable to variable degrees of protein characterization, depending on available information about the chemical/protein interaction and the molecular target itself. To allow for flexibility in the analysis, a layered strategy was adopted for the tool. The first level of the SeqAPASS analysis compares primary amino acid sequences to a query sequence, calculating a metric for sequence similarity (including detection of candidate orthologs), the second level evaluates sequence similarity within selected domains (e.g., ligand-binding domain, DNA binding domain), and the third level of analysis compares individual amino acid residue positions identified as being of importance for protein conformation and/or ligand binding upon chemical perturbation. Each level of the SeqAPASS analysis provides increasing evidence to apply toward rapid, screening-level assessments of probable cross species susceptibility. Such analyses can support prioritization of chemicals for further ev

  3. Amino Acid Sequence Autocorrelation vectors and ensembles of Bayesian-Regularized Genetic Neural Networks for prediction of conformational stability of human lysozyme mutants.

    PubMed

    Caballero, Julio; Fernández, Leyden; Abreu, José Ignacio; Fernández, Michael

    2006-01-01

    Development of novel computational approaches for modeling protein properties from their primary structure is a main goal in applied proteomics. In this work, we reported the extension of the autocorrelation vector formalism to amino acid sequences for encoding protein structural information with modeling purposes. Amino Acid Sequence Autocorrelation (AASA) vectors were calculated by measuring the autocorrelations at sequence lags ranging from 1 to 15 on the protein primary structure of 48 amino acid/residue properties selected from the AAindex database. A total of 720 AASA descriptors were tested for building predictive models of the thermal unfolding Gibbs free energy change of human lysozyme mutants. In this sense, ensembles of Bayesian-Regularized Genetic Neural Networks (BRGNNs) were used for obtaining an optimum nonlinear model for the conformational stability. The ensemble predictor described about 88% and 68% variance of the data in training and test sets, respectively. Furthermore, the optimum AASA vector subset was shown not only to successfully model unfolding thermal stability but also to distribute wild-type and mutant lysozymes on a stability Self-organized Map (SOM) when used for unsupervised training of competitive neurons.

  4. Prediction, sequences and the hippocampus

    PubMed Central

    Lisman, John; Redish, A.D.

    2009-01-01

    Recordings of rat hippocampal place cells have provided information about how the hippocampus retrieves memory sequences. One line of evidence has to do with phase precession, a process organized by theta and gamma oscillations. This precession can be interpreted as the cued prediction of the sequence of upcoming positions. In support of this interpretation, experiments in two-dimensional environments and on a cue-rich linear track demonstrate that many cells represent a position ahead of the animal and that this position is the same irrespective of which direction the rat is coming from. Other lines of investigation have demonstrated that such predictive processes also occur in the non-spatial domain and that retrieval can be internally or externally cued. The mechanism of sequence retrieval and the usefulness of this retrieval to guide behaviour are discussed. PMID:19528000

  5. Characterization of cDNA clones for human myeloperoxidase: predicted amino acid sequence and evidence for multiple mRNA species.

    PubMed Central

    Johnson, K R; Nauseef, W M; Care, A; Wheelock, M J; Shane, S; Hudson, S; Koeffler, H P; Selsted, M; Miller, C; Rovera, G

    1987-01-01

    Myeloperoxidase is a component of the microbicidal network of polymorphonuclear leukocytes. The enzyme is a tetramer consisting of two heavy and two light subunits. A large proportion of humans demonstrate genetic deficiencies in the production of myeloperoxidase. As a first step in analyzing these deficiencies in more detail, we have isolated cDNA clones for myeloperoxidase from an expression library of the HL-60 human promyelocytic leukemia cell line. Two overlapping plasmids (pMP02 and pMP062) were identified as myeloperoxidase cDNA clones based on the detection with myeloperoxidase antiserum of 70 kDa protein expressed in pMP02-containing bacteria and a 75 kDa polypeptide produced by hybridization selection and translation using pMP062 and HL-60 RNA. Formal identification of the clones was made by matching the predicted amino acid sequences with the amino terminal sequences of the heavy and light subunits. Both subunits are encoded by one mRNA in the following order: pre-pro-sequences--light subunit--heavy subunit. The molecular weight of the predicted primary translation product is 83.7 kDa. Northern blots reveal two size classes of hybridizing RNAs (approximately 3.0-3.3 and 3.5-4.0 kilobases) whose expression is restricted to cells of the granulocytic lineage and parallels the changes in enzymatic activity observed during differentiation. Images PMID:3031585

  6. High speed nucleic acid sequencing

    DOEpatents

    Korlach, Jonas [Ithaca, NY; Webb, Watt W [Ithaca, NY; Levene, Michael [Ithaca, NY; Turner, Stephen [Ithaca, NY; Craighead, Harold G [Ithaca, NY; Foquet, Mathieu [Ithaca, NY

    2011-05-17

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid. Each type of labeled nucleotide comprises an acceptor fluorophore attached to a phosphate portion of the nucleotide such that the fluorophore is removed upon incorporation into a growing strand. Fluorescent signal is emitted via fluorescent resonance energy transfer between the donor fluorophore and the acceptor fluorophore as each nucleotide is incorporated into the growing strand. The sequence is deduced by identifying which base is being incorporated into the growing strand.

  7. Amino acid sequence of the AhR1 ligand-binding domain predicts avian sensitivity to dioxin like compounds: in vivo verification in European starlings.

    PubMed

    Eng, Margaret L; Elliott, John E; Jones, Stephanie P; Williams, Tony D; Drouillard, Ken G; Kennedy, Sean W

    2014-12-01

    Research has demonstrated that the sensitivity of avian species to the embyrotoxic effects of dioxin-like compounds can be predicted by the amino acid identities at two key sites within the ligand-binding domain of the aryl hydrocarbon receptor 1 (AhR1). The domestic chicken (Gallus gallus domesticus) has been established as a highly sensitive species to the toxic effects of dioxin-like compounds. Results from genotyping and in vitro assays predict that the European starling (Sturnus vulgaris) is also highly sensitive to dioxin-like compound toxicity. The objective of the present study was to test that prediction in vivo. To do this, we used egg injections in field nesting starlings with 3,3',4,4',5-pentachlorobiphenyl (PCB-126), a dioxin-like polychlorinated biphenyl. Eggs were dosed with either the vehicle control or 1 of 5 doses (1.4, 7.1, 15.9, 32.1, and 52.9 ng PCB-126/g egg). A dose-dependent increase in embryo mortality occurred, and the median lethal dose (LD50; 95% confidence interval [CI]) was 5.61 (2.33-9.08) ng/g. Hepatic CYP1A4/5 messenger RNA (mRNA) expression in hatchlings also increased in a dose-dependent manner, with CYP1A4 being more induced than CYP1A5. No effect of dose on morphological measures was seen, and we did not observe any overt malformations. These results indicate that, other than the chicken, the European starling is the most sensitive species to the effects of PCB-126 on avian embryo mortality reported to date, which supports the prediction of relative sensitivity to dioxin-like compounds based on amino acid sequence of the AhR1. © 2014 SETAC.

  8. Predict protein structural class for low-similarity sequences by evolutionary difference information into the general form of Chou's pseudo amino acid composition.

    PubMed

    Zhang, Lichao; Zhao, Xiqiang; Kong, Liang

    2014-08-21

    Knowledge of protein structural class plays an important role in characterizing the overall folding type of a given protein. At present, it is still a challenge to extract sequence information solely using protein sequence for protein structural class prediction with low similarity sequence in the current computational biology. In this study, a novel sequence representation method is proposed based on position specific scoring matrix for protein structural class prediction. By defined evolutionary difference formula, varying length proteins are expressed as uniform dimensional vectors, which can represent evolutionary difference information between the adjacent residues of a given protein. To perform and evaluate the proposed method, support vector machine and jackknife tests are employed on three widely used datasets, 25PDB, 1189 and 640 datasets with sequence similarity lower than 25%, 40% and 25%, respectively. Comparison of our results with the previous methods shows that our method may provide a promising method to predict protein structural class especially for low-similarity sequences.

  9. Discrete sequence prediction and its applications

    NASA Technical Reports Server (NTRS)

    Laird, Philip

    1992-01-01

    Learning from experience to predict sequences of discrete symbols is a fundamental problem in machine learning with many applications. We apply sequence prediction using a simple and practical sequence-prediction algorithm, called TDAG. The TDAG algorithm is first tested by comparing its performance with some common data compression algorithms. Then it is adapted to the detailed requirements of dynamic program optimization, with excellent results.

  10. Chip-based sequencing nucleic acids

    DOEpatents

    Beer, Neil Reginald

    2014-08-26

    A system for fast DNA sequencing by amplification of genetic material within microreactors, denaturing, demulsifying, and then sequencing the material, while retaining it in a PCR/sequencing zone by a magnetic field. One embodiment includes sequencing nucleic acids on a microchip that includes a microchannel flow channel in the microchip. The nucleic acids are isolated and hybridized to magnetic nanoparticles or to magnetic polystyrene-coated beads. Microreactor droplets are formed in the microchannel flow channel. The microreactor droplets containing the nucleic acids and the magnetic nanoparticles are retained in a magnetic trap in the microchannel flow channel and sequenced.

  11. KM+, a mannose-binding lectin from Artocarpus integrifolia: amino acid sequence, predicted tertiary structure, carbohydrate recognition, and analysis of the beta-prism fold.

    PubMed Central

    Rosa, J. C.; De Oliveira, P. S.; Garratt, R.; Beltramini, L.; Resing, K.; Roque-Barreira, M. C.; Greene, L. J.

    1999-01-01

    The complete amino acid sequence of the lectin KM+ from Artocarpus integrifolia (jackfruit), which contains 149 residues/mol, is reported and compared to those of other members of the Moraceae family, particularly that of jacalin, also from jackfruit, with which it shares 52% sequence identity. KM+ presents an acetyl-blocked N-terminus and is not posttranslationally modified by proteolytic cleavage as is the case for jacalin. Rather, it possesses a short, glycine-rich linker that unites the regions homologous to the alpha- and beta-chains of jacalin. The results of homology modeling implicate the linker sequence in sterically impeding rotation of the side chain of Asp141 within the binding site pocket. As a consequence, the aspartic acid is locked into a conformation adequate only for the recognition of equatorial hydroxyl groups on the C4 epimeric center (alpha-D-mannose, alpha-D-glucose, and their derivatives). In contrast, the internal cleavage of the jacalin chain permits free rotation of the homologous aspartic acid, rendering it capable of accepting hydrogen bonds from both possible hydroxyl configurations on C4. We suggest that, together with direct recognition of epimeric hydroxyls and the steric exclusion of disfavored ligands, conformational restriction of the lectin should be considered to be a new mechanism by which selectivity may be built into carbohydrate binding sites. Jacalin and KM+ adopt the beta-prism fold already observed in two unrelated protein families. Despite presenting little or no sequence similarity, an analysis of the beta-prism reveals a canonical feature repeatedly present in all such structures, which is based on six largely hydrophobic residues within a beta-hairpin containing two classic-type beta-bulges. We suggest the term beta-prism motif to describe this feature. PMID:10210179

  12. Dipeptide Sequence Determination: Analyzing Phenylthiohydantoin Amino Acids by HPLC

    NASA Astrophysics Data System (ADS)

    Barton, Janice S.; Tang, Chung-Fei; Reed, Steven S.

    2000-02-01

    Amino acid composition and sequence determination, important techniques for characterizing peptides and proteins, are essential for predicting conformation and studying sequence alignment. This experiment presents improved, fundamental methods of sequence analysis for an upper-division biochemistry laboratory. Working in pairs, students use the Edman reagent to prepare phenylthiohydantoin derivatives of amino acids for determination of the sequence of an unknown dipeptide. With a single HPLC technique, students identify both the N-terminal amino acid and the composition of the dipeptide. This method yields good precision of retention times and allows use of a broad range of amino acids as components of the dipeptide. Students learn fundamental principles and techniques of sequence analysis and HPLC.

  13. Amino acid composition predicts prion activity.

    PubMed

    Afsar Minhas, Fayyaz Ul Amir; Ross, Eric D; Ben-Hur, Asa

    2017-04-10

    Many prion-forming proteins contain glutamine/asparagine (Q/N) rich domains, and there are conflicting opinions as to the role of primary sequence in their conversion to the prion form: is this phenomenon driven primarily by amino acid composition, or, as a recent computational analysis suggested, dependent on the presence of short sequence elements with high amyloid-forming potential. The argument for the importance of short sequence elements hinged on the relatively-high accuracy obtained using a method that utilizes a collection of length-six sequence elements with known amyloid-forming potential. We weigh in on this question and demonstrate that when those sequence elements are permuted, even higher accuracy is obtained; we also propose a novel multiple-instance machine learning method that uses sequence composition alone, and achieves better accuracy than all existing prion prediction approaches. While we expect there to be elements of primary sequence that affect the process, our experiments suggest that sequence composition alone is sufficient for predicting protein sequences that are likely to form prions. A web-server for the proposed method is available at http://faculty.pieas.edu.pk/fayyaz/prank.html, and the code for reproducing our experiments is available at http://doi.org/10.5281/zenodo.167136.

  14. Distinguishing Proteins From Arbitrary Amino Acid Sequences

    PubMed Central

    Yau, Stephen S.-T.; Mao, Wei-Guang; Benson, Max; He, Rong Lucy

    2015-01-01

    What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe. PMID:25609314

  15. The complete amino acid sequence of prochymosin.

    PubMed Central

    Foltmann, B; Pedersen, V B; Jacobsen, H; Kauffman, D; Wybrandt, G

    1977-01-01

    The total sequence of 365 amino acid residues in bovine prochymosin is presented. Alignment with the amino acid sequence of porcine pepsinogen shows that 204 amino acid residues are common to the two zymogens. Further comparison and alignment with the amino acid sequence of penicillopepsin shows that 66 residues are located at identical positions in all three proteases. The three enzymes belong to a large group of proteases with two aspartate residues in the active center. This group forms a family derived from one common ancestor. PMID:329280

  16. Method for sequencing nucleic acid molecules

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2006-06-06

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  17. Method for sequencing nucleic acid molecules

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2006-05-30

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  18. Method for sequencing nucleic acid molecules

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2006-06-06

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  19. Temporal and kinematic consistency predict sequence awareness.

    PubMed

    Jaynes, Molly J; Schieber, Marc H; Mink, Jonathan W

    2016-10-01

    Many human motor skills can be represented as a hierarchical series of movement patterns. Awareness of underlying patterns can improve performance and decrease cognitive load. Subjects (n = 30) tapped a finger sequence with changing stimulus-to-response mapping and a common movement sequence. Thirteen subjects (43 %) became aware that they were tapping a familiar movement sequence during the experiment. Subjects who became aware of the underlying motor pattern tapped with greater kinematic and temporal consistency from task onset, but consistency was not sufficient for awareness. We found no effect of age, musical experience, tapping evenness, or inter-key-interval on awareness of the pattern in the motor response. We propose that temporal or kinematic consistency reinforces a pattern representation, but cognitive engagement with the contents of the sequence is necessary to bring the pattern to conscious awareness. These findings predict benefit for movement strategies that limit temporal and kinematic variability during motor learning.

  20. Protein structure prediction from sequence variation

    PubMed Central

    Marks, Debora S; Hopf, Thomas A; Sander, Chris

    2015-01-01

    Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics. PMID:23138306

  1. Sequence memory for prediction, inference and behaviour

    PubMed Central

    Hawkins, Jeff; George, Dileep; Niemasik, Jamie

    2009-01-01

    In this paper, we propose a mechanism which the neocortex may use to store sequences of patterns. Storing and recalling sequences are necessary for making predictions, recognizing time-based patterns and generating behaviour. Since these tasks are major functions of the neocortex, the ability to store and recall time-based sequences is probably a key attribute of many, if not all, cortical areas. Previously, we have proposed that the neocortex can be modelled as a hierarchy of memory regions, each of which learns and recalls sequences. This paper proposes how each region of neocortex might learn the sequences necessary for this theory. The basis of the proposal is that all the cells in a cortical column share bottom-up receptive field properties, but individual cells in a column learn to represent unique incidences of the bottom-up receptive field property within different sequences. We discuss the proposal, the biological constraints that led to it and some results modelling it. PMID:19528001

  2. 77 FR 65537 - Requirements for Patent Applications Containing Nucleotide Sequence and/or Amino Acid Sequence...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-29

    ... Amino Acid Sequence Disclosures ACTION: Proposed collection; comment request. SUMMARY: The United States....'' SUPPLEMENTARY INFORMATION: I. Abstract Patent applications that contain nucleotide and/or amino acid sequence...

  3. Analysis and Annotation of Nucleic Acid Sequence

    SciTech Connect

    States, David J.

    2004-07-28

    The aims of this project were to develop improved methods for computational genome annotation and to apply these methods to improve the annotation of genomic sequence data with a specific focus on human genome sequencing. The project resulted in a substantial body of published work. Notable contributions of this project were the identification of basecalling and lane tracking as error processes in genome sequencing and contributions to improved methods for these steps in genome sequencing. This technology improved the accuracy and throughput of genome sequence analysis. Probabilistic methods for physical map construction were developed. Improved methods for sequence alignment, alternative splicing analysis, promoter identification and NF kappa B response gene prediction were also developed.

  4. Predicting pseudoknotted structures across two RNA sequences

    PubMed Central

    Sperschneider, Jana; Datta, Amitava; Wise, Michael J.

    2012-01-01

    Motivation: Laboratory RNA structure determination is demanding and costly and thus, computational structure prediction is an important task. Single sequence methods for RNA secondary structure prediction are limited by the accuracy of the underlying folding model, if a structure is supported by a family of evolutionarily related sequences, one can be more confident that the prediction is accurate. RNA pseudoknots are functional elements, which have highly conserved structures. However, few comparative structure prediction methods can handle pseudoknots due to the computational complexity. Results: A comparative pseudoknot prediction method called DotKnot-PW is introduced based on structural comparison of secondary structure elements and H-type pseudoknot candidates. DotKnot-PW outperforms other methods from the literature on a hand-curated test set of RNA structures with experimental support. Availability: DotKnot-PW and the RNA structure test set are available at the web site http://dotknot.csse.uwa.edu.au/pw. Contact: janaspe@csse.uwa.edu.au Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23044552

  5. Lossless Video Sequence Compression Using Adaptive Prediction

    NASA Technical Reports Server (NTRS)

    Li, Ying; Sayood, Khalid

    2007-01-01

    We present an adaptive lossless video compression algorithm based on predictive coding. The proposed algorithm exploits temporal, spatial, and spectral redundancies in a backward adaptive fashion with extremely low side information. The computational complexity is further reduced by using a caching strategy. We also study the relationship between the operational domain for the coder (wavelet or spatial) and the amount of temporal and spatial redundancy in the sequence being encoded. Experimental results show that the proposed scheme provides significant improvements in compression efficiencies.

  6. Lossless Video Sequence Compression Using Adaptive Prediction

    NASA Technical Reports Server (NTRS)

    Li, Ying; Sayood, Khalid

    2007-01-01

    We present an adaptive lossless video compression algorithm based on predictive coding. The proposed algorithm exploits temporal, spatial, and spectral redundancies in a backward adaptive fashion with extremely low side information. The computational complexity is further reduced by using a caching strategy. We also study the relationship between the operational domain for the coder (wavelet or spatial) and the amount of temporal and spatial redundancy in the sequence being encoded. Experimental results show that the proposed scheme provides significant improvements in compression efficiencies.

  7. Phenolic acid esterases, coding sequences and methods

    DOEpatents

    Blum, David L.; Kataeva, Irina; Li, Xin-Liang; Ljungdahl, Lars G.

    2002-01-01

    Described herein are four phenolic acid esterases, three of which correspond to domains of previously unknown function within bacterial xylanases, from XynY and XynZ of Clostridium thermocellum and from a xylanase of Ruminococcus. The fourth specifically exemplified xylanase is a protein encoded within the genome of Orpinomyces PC-2. The amino acids of these polypeptides and nucleotide sequences encoding them are provided. Recombinant host cells, expression vectors and methods for the recombinant production of phenolic acid esterases are also provided.

  8. Computer analysis and structure prediction of nucleic acids and proteins.

    PubMed Central

    Kanehisa, M; Klein, P; Greif, P; DeLisi, C

    1984-01-01

    We have developed an integrated computer system for analysis of nucleic acid and protein sequences, which consists of sequence and structure databases, a relational database, and software for structural analysis. The system is potentially applicable to a number of problems in structural biology including predictive classification of the function and location of oncogene products. PMID:6546426

  9. Prediction of protein function from protein sequence and structure.

    PubMed

    Whisstock, James C; Lesk, Arthur M

    2003-08-01

    The sequence of a genome contains the plans of the possible life of an organism, but implementation of genetic information depends on the functions of the proteins and nucleic acids that it encodes. Many individual proteins of known sequence and structure present challenges to the understanding of their function. In particular, a number of genes responsible for diseases have been identified but their specific functions are unknown. Whole-genome sequencing projects are a major source of proteins of unknown function. Annotation of a genome involves assignment of functions to gene products, in most cases on the basis of amino-acid sequence alone. 3D structure can aid the assignment of function, motivating the challenge of structural genomics projects to make structural information available for novel uncharacterized proteins. Structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable. Nevertheless, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Many methods of function prediction rely on identifying similarity in sequence and/or structure between a protein of unknown function and one or more well-understood proteins. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known. However, these inferences are tenuous. Such methods provide reasonable guesses at function, but are far from foolproof. It is therefore fortunate that the development of whole-organism approaches and comparative genomics permits other approaches to function prediction when the data are available. These include the use of protein-protein interaction patterns, and correlations between occurrences of related proteins in different organisms, as

  10. Method for identifying and quantifying nucleic acid sequence aberrations

    DOEpatents

    Lucas, J.N.; Straume, T.; Bogen, K.T.

    1998-07-21

    A method is disclosed for detecting nucleic acid sequence aberrations by detecting nucleic acid sequences having both a first and a second nucleic acid sequence type, the presence of the first and second sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. The method uses a first hybridization probe which includes a nucleic acid sequence that is complementary to a first sequence type and a first complexing agent capable of attaching to a second complexing agent and a second hybridization probe which includes a nucleic acid sequence that selectively hybridizes to the second nucleic acid sequence type over the first sequence type and includes a detectable marker for detecting the second hybridization probe. 11 figs.

  11. Method for identifying and quantifying nucleic acid sequence aberrations

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    1998-01-01

    A method for detecting nucleic acid sequence aberrations by detecting nucleic acid sequences having both a first and a second nucleic acid sequence type, the presence of the first and second sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. The method uses a first hybridization probe which includes a nucleic acid sequence that is complementary to a first sequence type and a first complexing agent capable of attaching to a second complexing agent and a second hybridization probe which includes a nucleic acid sequence that selectively hybridizes to the second nucleic acid sequence type over the first sequence type and includes a detectable marker for detecting the second hybridization probe.

  12. Structural gene and complete amino acid sequence of Vibrio alginolyticus collagenase.

    PubMed Central

    Takeuchi, H; Shibano, Y; Morihara, K; Fukushima, J; Inami, S; Keil, B; Gilles, A M; Kawamoto, S; Okuda, K

    1992-01-01

    The DNA encoding the collagenase of Vibrio alginolyticus was cloned, and its complete nucleotide sequence was determined. When the cloned gene was ligated to pUC18, the Escherichia coli expression vector, bacteria carrying the gene exhibited both collagenase antigen and collagenase activity. The open reading frame from the ATG initiation codon was 2442 bp in length for the collagenase structural gene. The amino acid sequence, deduced from the nucleotide sequence, revealed that the mature collagenase consists of 739 amino acids with an Mr of 81875. The amino acid sequences of 20 polypeptide fragments were completely identical with the deduced amino acid sequences of the collagenase gene. The amino acid composition predicted from the DNA sequence was similar to the chemically determined composition of purified collagenase reported previously. The analyses of both the DNA and amino acid sequences of the collagenase gene were rigorously performed, but we could not detect any significant sequence similarity to other collagenases. Images Fig. 2. PMID:1311172

  13. Extensive amino acid sequence homologies between animal lectins

    SciTech Connect

    Paroutaud, P.; Levi, G.; Teichberg, V.I.; Strosberg, A.D.

    1987-09-01

    The authors have established the amino acid sequence of the ..beta..-D-galactoside binding lectin from the electric eel and the sequences of several peptides from a similar lectin isolated from human placenta. These sequences were compared with the published sequences of peptides derived from the ..beta..-D-galactoside binding lectin from human lung and with sequences deduced from cDNAs assigned to the ..beta..-D-galactoside binding lectins from chicken embryo skin and human hepatomas. Significant homologies were observed. One of the highly conserved regions that contains a tryptophan residue and two glutamic acid resides is probably part of the ..beta..-D-galactoside binding site, which, on the basis of spectroscopic studies of the electric eel lectin, is expected to contain such residues. The similarity of the hydropathy profiles and the predicted secondary structure of the lectins from chicken skin and electric eel, in spite of differences in their amino acid sequences, strongly suggests that these proteins have maintained structural homologies during evolution and together with the other ..beta..-D-galactoside binding lectins were derived form a common ancestor gene.

  14. Methods for analyzing nucleic acid sequences

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2011-05-17

    The present invention is directed to a method of sequencing a target nucleic acid. The method provides a complex comprising a polymerase enzyme, a target nucleic acid molecule, and a primer, wherein the complex is immobilized on a support Fluorescent label is attached to a terminal phosphate group of the nucleotide or nucleotide analog. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The time duration of the signal from labeled nucleotides or nucleotide analogs that become incorporated is distinguished from freely diffusing labels by a longer retention in the observation volume for the nucleotides or nucleotide analogs that become incorporated than for the freely diffusing labels.

  15. Porcine proinsulin: characterization and amino acid sequence.

    PubMed

    Chance, R E; Ellis, R M; Bromer, W W

    1968-07-12

    Proinsulin in nearly homogeneous form has been isolated from a preparation of porcine insulin. A molecular weight close to 9100 was calculated from the amino acid composition and from sedimentation-equilibrium studies. Through the action of trypsin this single-chain protein is transformed to desalanine insulin by cleavage of a polypeptide chain connecting the carboxy-terminus of the B chain to the amino-terminus of the A chain of insulin. The amino acid sequence of this connecting peptide was found to be Arg-Arg-Glu-Ala-Gln-Asn-Pro-Gln-Ala-Gly-Ala-Val-Glu-Leu-Gly-Gly-Gly-Leu-Gly-Gly-Leu-Gln-Ala-Leu-Ala-Leu-Glu-Gly-Pro-Pro-Gln-Lys-Arg.

  16. Amino acid sequence and comparative antigenicity of chicken metallothionein.

    PubMed Central

    McCormick, C C; Fullmer, C S; Garvey, J S

    1988-01-01

    The complete amino acid sequence of metallothionein (MT) from chicken liver is reported. The primary structure was determined by automated sequence analysis of peptides produced by limited acid hydrolysis and by trypsin digestion. The comparative antigenicity of chicken MT was determined by radioimmunoassay using rabbit anti-rat MT polyclonal antibody. Chicken MT consists of 63 amino acids as compared to 61 found in MTs from mammals. One insertion (and two substitutions) occurs in the amino-terminal region, a region considered invariant among mammalian MTs. Eighteen of the 20 cysteines in chicken MT were aligned with cysteines from other mammalian sequences. Two cysteines near the carboxyl terminus are shifted by one residue due to the insertion of proline in that region. Overall, the chicken protein showed approximately equal to 68% sequence identity in a comparison with various mammalian MTs. The affinity of the polyclonal antibody for chicken MT was decreased by 2 orders of magnitude in comparison to that of a mammalian MT (rat MT isoforms). This reduced affinity is attributed to major substitutions in chicken MT in the regions of the principal determinants of mammalian MTs. Theoretical analysis of the primary structure predicted the secondary structure to consist of reverse turns and random coils with no stable beta or helix conformations. There is no evidence that chicken MT differs functionally from mammalian MTs. PMID:2448773

  17. Inter-domain linker prediction using amino acid compositional index.

    PubMed

    Shatnawi, Maad; Zaki, Nazar

    2015-04-01

    Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate and reliable prediction of protein domain linkers and boundaries is often considered to be the initial step of protein tertiary structure and function predictions. In this paper, we introduce CISA as a method for predicting inter-domain linker regions solely from the amino acid sequence information. The method first computes the amino acid compositional index from the protein sequence dataset of domain-linker segments and the amino acid composition. A preference profile is then generated by calculating the average compositional index values along the amino acid sequence using a sliding window. Finally, the protein sequence is segmented into intervals and a simulated annealing algorithm is employed to enhance the prediction by finding the optimal threshold value for each segment that separates domains from inter-domain linkers. The method was tested on two standard protein datasets and showed considerable improvement over the state-of-the-art domain linker prediction methods. Copyright © 2015 Elsevier Ltd. All rights reserved.

  18. High sequence homology between protein tyrosine acid phosphatase from boar seminal vesicles and human prostatic acid phosphatase.

    PubMed

    Wysocki, Paweł; Płucienniczak, Grazyna; Strzezek, Jerzy

    2009-01-01

    Boar seminal vesicle protein tyrosine acid phosphatase (PTAP) and human prostatic acid phosphatase (PAP) show high affinity for protein phosphotyrosine residues. The physico-chemical and kinetic properties of the boar and human enzymes are different. The main objective of this study was to establish the nucleotide sequence of cDNA encoding boar PTAP and compare it with that of human PAP cDNA. Also, the amino-acid sequence of boar PTAP was compared with the sequence of human PAP. PTAP was isolated from boar seminal vesicle fluid and sequenced. cDNA to boar seminal vesicle RNA was synthesized, amplified by PCR, cloned in E. coli and sequenced. The obtained N-terminal amino-acid sequence of boar PTAP showed 92% identity with the N-terminal amino-acid sequence of human PAP. The determined sequence of a 354 bp nucleotide fragment (GenBank accession number: GQ184596) showed 90% identity with the corresponding sequence of human PAP. On the basis of this sequence a 118 amino acid fragment of boar PTAP was predicted. This fragment showed 89% identity with the corresponding fragment of human PAP and had a similar hydropathy profile. The compared sequences differ in terms of their isoelectric points and amino-acid composition. This may explain the differences in substrate specificity and inhibitor resistance of boar PTAP and human PAP.

  19. Predictive uncertainty in auditory sequence processing

    PubMed Central

    Hansen, Niels Chr.; Pearce, Marcus T.

    2014-01-01

    Previous studies of auditory expectation have focused on the expectedness perceived by listeners retrospectively in response to events. In contrast, this research examines predictive uncertainty—a property of listeners' prospective state of expectation prior to the onset of an event. We examine the information-theoretic concept of Shannon entropy as a model of predictive uncertainty in music cognition. This is motivated by the Statistical Learning Hypothesis, which proposes that schematic expectations reflect probabilistic relationships between sensory events learned implicitly through exposure. Using probability estimates from an unsupervised, variable-order Markov model, 12 melodic contexts high in entropy and 12 melodic contexts low in entropy were selected from two musical repertoires differing in structural complexity (simple and complex). Musicians and non-musicians listened to the stimuli and provided explicit judgments of perceived uncertainty (explicit uncertainty). We also examined an indirect measure of uncertainty computed as the entropy of expectedness distributions obtained using a classical probe-tone paradigm where listeners rated the perceived expectedness of the final note in a melodic sequence (inferred uncertainty). Finally, we simulate listeners' perception of expectedness and uncertainty using computational models of auditory expectation. A detailed model comparison indicates which model parameters maximize fit to the data and how they compare to existing models in the literature. The results show that listeners experience greater uncertainty in high-entropy musical contexts than low-entropy contexts. This effect is particularly apparent for inferred uncertainty and is stronger in musicians than non-musicians. Consistent with the Statistical Learning Hypothesis, the results suggest that increased domain-relevant training is associated with an increasingly accurate cognitive model of probabilistic structure in music. PMID:25295018

  20. SeqAPASS: Sequence alignment to predict across-species ...

    EPA Pesticide Factsheets

    Efforts to shift the toxicity testing paradigm from whole organism studies to those focused on the initiation of toxicity and relevant pathways have led to increased utilization of in vitro and in silico methods. Hence the emergence of high through-put screening (HTS) programs, such as U.S. EPA ToxCast, and application of the adverse outcome pathway (AOP) framework for identifying and defining biological key events triggered upon perturbation of molecular initiating events and leading to adverse outcomes occuring at a level of organization relevant for risk assessment [1]. With these recent initiatives to harness the power of “the pathway” in describing and evaluating toxicity comes the need to extrapolate data beyond the model species. Sequence alignment to predict across-species susceptibilty (SeqAPASS) is a web-based tool that allows the user to begin to understand how broadly HTS data or AOP constructs may plausibly be extrapolated across species, while describing the relative intrinsic susceptibiltiy of different taxa to chemicals with known modes of action (e.g., pharmaceuticals and pesticides). The tool rapidly and strategically assesses available molecular target information to describe protein sequence similarity at the primary amino acid sequence, conserved domain, and individual amino acid residue levels. This in silico approach to species extrapolation was designed to automate and streamline the relatively complex and time-consuming process of co

  1. MSACompro: improving multiple protein sequence alignment by predicted structural features.

    PubMed

    Deng, Xin; Cheng, Jianlin

    2014-01-01

    Multiple Sequence Alignment (MSA) is an essential tool in protein structure modeling, gene and protein function prediction, DNA motif recognition, phylogenetic analysis, and many other bioinformatics tasks. Therefore, improving the accuracy of multiple sequence alignment is an important long-term objective in bioinformatics. We designed and developed a new method MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. Different from the multiple sequence alignment methods that use the tertiary structure information of some sequences, our method uses the structural information purely predicted from sequences. In this chapter, we first introduce some background and related techniques in the field of multiple sequence alignment. Then, we describe the detailed algorithm of MSACompro. Finally, we show that integrating predicted protein structural information improved the multiple sequence alignment accuracy.

  2. Detection of nucleic acid sequences by invader-directed cleavage

    DOEpatents

    Brow, Mary Ann D.; Hall, Jeff Steven Grotelueschen; Lyamichev, Victor; Olive, David Michael; Prudent, James Robert

    1999-01-01

    The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The 5' nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof. The present invention further relates to methods and devices for the separation of nucleic acid molecules based by charge.

  3. Sequence-Based Prediction of Type III Secreted Proteins

    PubMed Central

    Arnold, Roland; Brandmaier, Stefan; Kleine, Frederick; Tischler, Patrick; Heinz, Eva; Behrens, Sebastian; Niinikoski, Antti; Mewes, Hans-Werner; Horn, Matthias; Rattei, Thomas

    2009-01-01

    The type III secretion system (TTSS) is a key mechanism for host cell interaction used by a variety of bacterial pathogens and symbionts of plants and animals including humans. The TTSS represents a molecular syringe with which the bacteria deliver effector proteins directly into the host cell cytosol. Despite the importance of the TTSS for bacterial pathogenesis, recognition and targeting of type III secreted proteins has up until now been poorly understood. Several hypotheses are discussed, including an mRNA-based signal, a chaperon-mediated process, or an N-terminal signal peptide. In this study, we systematically analyzed the amino acid composition and secondary structure of N-termini of 100 experimentally verified effector proteins. Based on this, we developed a machine-learning approach for the prediction of TTSS effector proteins, taking into account N-terminal sequence features such as frequencies of amino acids, short peptides, or residues with certain physico-chemical properties. The resulting computational model revealed a strong type III secretion signal in the N-terminus that can be used to detect effectors with sensitivity of ∼71% and selectivity of ∼85%. This signal seems to be taxonomically universal and conserved among animal pathogens and plant symbionts, since we could successfully detect effector proteins if the respective group was excluded from training. The application of our prediction approach to 739 complete bacterial and archaeal genome sequences resulted in the identification of between 0% and 12% putative TTSS effector proteins. Comparison of effector proteins with orthologs that are not secreted by the TTSS showed no clear pattern of signal acquisition by fusion, suggesting convergent evolutionary processes shaping the type III secretion signal. The newly developed program EffectiveT3 (http://www.chlamydiaedb.org) is the first universal in silico prediction program for the identification of novel TTSS effectors. Our findings will

  4. Hybridization and sequencing of nucleic acids using base pair mismatches

    DOEpatents

    Fodor, Stephen P. A.; Lipshutz, Robert J.; Huang, Xiaohua

    2001-01-01

    Devices and techniques for hybridization of nucleic acids and for determining the sequence of nucleic acids. Arrays of nucleic acids are formed by techniques, preferably high resolution, light-directed techniques. Positions of hybridization of a target nucleic acid are determined by, e.g., epifluorescence microscopy. Devices and techniques are proposed to determine the sequence of a target nucleic acid more efficiently and more quickly through such synthesis and detection techniques.

  5. Quantitative assessment of protein function prediction from metagenomics shotgun sequences.

    PubMed

    Harrington, E D; Singh, A H; Doerks, T; Letunic, I; von Mering, C; Jensen, L J; Raes, J; Bork, P

    2007-08-28

    To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.

  6. Selecting sequence variants to improve genomic predictions for dairy cattle

    USDA-ARS?s Scientific Manuscript database

    Millions of genetic variants have been identified by population-scale sequencing projects, but subsets are needed for routine genomic predictions or to include on genotyping arrays. Methods of selecting sequence variants were compared using both simulated sequence genotypes and actual data from run ...

  7. Gene and translation initiation site prediction in metagenomic sequences

    SciTech Connect

    Hyatt, Philip Douglas; LoCascio, Philip F; Hauser, Loren John; Uberbacher, Edward C

    2012-01-01

    Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements.

  8. The effect of sequencing errors on metagenomic gene prediction.

    PubMed

    Hoff, Katharina J

    2009-11-12

    Gene prediction is an essential step in the annotation of metagenomic sequencing reads. Since most metagenomic reads cannot be assembled into long contigs, specialized statistical gene prediction tools have been developed for short and anonymous DNA fragments, e.g. MetaGeneAnnotator and Orphelia. While conventional gene prediction methods have been subject to a benchmark study on real sequencing reads with typical errors, such a comparison has not been conducted for specialized tools, yet. Their gene prediction accuracy was mostly measured on error free DNA fragments. In this study, Sanger and pyrosequencing reads were simulated on the basis of models that take all types of sequencing errors into account. All metagenomic gene prediction tools showed decreasing accuracy with increasing sequencing error rates. Performance results on an established metagenomic benchmark dataset are also reported. In addition, we demonstrate that ESTScan, a tool for sequencing error compensation in eukaryotic expressed sequence tags, outperforms some metagenomic gene prediction tools on reads with high error rates although it was not designed for the task at hand. This study fills an important gap in metagenomic gene prediction research. Specialized methods are evaluated and compared with respect to sequencing error robustness. Results indicate that the integration of error-compensating methods into metagenomic gene prediction tools would be beneficial to improve metagenome annotation quality.

  9. Predictability affects the perception of audiovisual synchrony in complex sequences.

    PubMed

    Cook, Laura A; Van Valkenburg, David L; Badcock, David R

    2011-10-01

    The ability to make accurate audiovisual synchrony judgments is affected by the "complexity" of the stimuli: We are much better at making judgments when matching single beeps or flashes as opposed to video recordings of speech or music. In the present study, we investigated whether the predictability of sequences affects whether participants report that auditory and visual sequences appear to be temporally coincident. When we reduced their ability to predict both the next pitch in the sequence and the temporal pattern, we found that participants were increasingly likely to report that the audiovisual sequences were synchronous. However, when we manipulated pitch and temporal predictability independently, the same effect did not occur. By altering the temporal density (items per second) of the sequences, we further determined that the predictability effect occurred only in temporally dense sequences: If the sequences were slow, participants' responses did not change as a function of predictability. We propose that reduced predictability affects synchrony judgments by reducing the effective pitch and temporal acuity in perception of the sequences.

  10. Draft Genome Sequence of Gephyronic Acid Producer Cystobacter violaceus Strain Cb vi76

    PubMed Central

    Stevens, D. Cole; Young, Jeanette; Carmichael, Rory; Tan, John

    2014-01-01

    A draft genome sequence of Cystobacter violaceus strain Cb vi76, which produces the eukaryotic protein synthesis inhibitor gephyronic acid, has been obtained. The genome contains numerous predicted secondary metabolite clusters, including the gephyronic acid biosynthetic pathway. This genome will contribute to the investigation of secondary metabolism in other Cystobacter strains. PMID:25502681

  11. Draft Genome Sequence of Cyanobacterium sp. Strain IPPAS B-1200 with a Unique Fatty Acid Composition

    PubMed Central

    Starikov, Alexander Y.; Usserbaeva, Aizhan A.; Sinetova, Maria A.; Sarsekeyeva, Fariza K.; Zayadan, Bolatkhan K.; Ustinova, Vera V.; Kupriyanova, Elena V.; Los, Dmitry A.

    2016-01-01

    Here, we report the draft genome of Cyanobacterium sp. IPPAS strain B-1200, isolated from Lake Balkhash, Kazakhstan, and characterized by the unique fatty acid composition of its membrane lipids, which are enriched with myristic and myristoleic acids. The approximate genome size is 3.4 Mb, and the predicted number of coding sequences is 3,119. PMID:27856596

  12. Functional region prediction with a set of appropriate homologous sequences-an index for sequence selection by integrating structure and sequence information with spatial statistics

    PubMed Central

    2012-01-01

    Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence

  13. Functional region prediction with a set of appropriate homologous sequences--an index for sequence selection by integrating structure and sequence information with spatial statistics.

    PubMed

    Nemoto, Wataru; Toh, Hiroyuki

    2012-05-29

    The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods

  14. Prediction of glycolipid-binding domains from the amino acid sequence of lipid raft-associated proteins: application to HpaA, a protein involved in the adhesion of Helicobacter pylori to gastrointestinal cells.

    PubMed

    Fantini, Jacques; Garmy, Nicolas; Yahi, Nouara

    2006-09-12

    Protein-glycolipid interactions mediate the attachment of various pathogens to the host cell surface as well as the association of numerous cellular proteins with lipid rafts. Thus, it is of primary importance to identify the protein domains involved in glycolipid recognition. Using structure similarity searches, we could identify a common glycolipid-binding domain in the three-dimensional structure of several proteins known to interact with lipid rafts. Yet the three-dimensional structure of most raft-targeted proteins is still unknown. In the present study, we have identified a glycolipid-binding domain in the amino acid sequence of a bacterial adhesin (Helicobacter pylori adhesin A, HpaA). The prediction was based on the major properties of the glycolipid-binding domains previously characterized by structural searches. A short (15-mer) synthetic peptide corresponding to this putative glycolipid-binding domain was synthesized, and we studied its interaction with glycolipid monolayers at the air-water interface. The synthetic HpaA peptide recognized LacCer but not Gb3. This glycolipid specificity was in line with that of the whole bacterium. Molecular modeling studies gave some insights into this high selectivity of interaction. It also suggested that Phe147 in HpaA played a key role in LacCer recognition, through sugar-aromatic CH-pi stacking interactions with the hydrophobic side of the galactose ring of LacCer. Correspondingly, the replacement of Phe147 with Ala strongly affected LacCer recognition, whereas substitution with Trp did not. Our method could be used to identify glycolipid-binding domains in microbial and cellular proteins interacting with lipid shells, rafts, and other specialized membrane microdomains.

  15. Methods and compositions for efficient nucleic acid sequencing

    DOEpatents

    Drmanac, Radoje

    2002-01-01

    Disclosed are novel methods and compositions for rapid and highly efficient nucleic acid sequencing based upon hybridization with two sets of small oligonucleotide probes of known sequences. Extremely large nucleic acid molecules, including chromosomes and non-amplified RNA, may be sequenced without prior cloning or subcloning steps. The methods of the invention also solve various current problems associated with sequencing technology such as, for example, high noise to signal ratios and difficult discrimination, attaching many nucleic acid fragments to a surface, preparing many, longer or more complex probes and labelling more species.

  16. Methods and compositions for efficient nucleic acid sequencing

    DOEpatents

    Drmanac, Radoje

    2006-07-04

    Disclosed are novel methods and compositions for rapid and highly efficient nucleic acid sequencing based upon hybridization with two sets of small oligonucleotide probes of known sequences. Extremely large nucleic acid molecules, including chromosomes and non-amplified RNA, may be sequenced without prior cloning or subcloning steps. The methods of the invention also solve various current problems associated with sequencing technology such as, for example, high noise to signal ratios and difficult discrimination, attaching many nucleic acid fragments to a surface, preparing many, longer or more complex probes and labelling more species.

  17. Kit for detecting nucleic acid sequences using competitive hybridization probes

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    2001-01-01

    A kit is provided for detecting a target nucleic acid sequence in a sample, the kit comprising: a first hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a first portion of the target sequence, the first hybridization probe including a first complexing agent for forming a binding pair with a second complexing agent; and a second hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a second portion of the target sequence to which the first hybridization probe does not selectively hybridize, the second hybridization probe including a detectable marker; a third hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a first portion of the target sequence, the third hybridization probe including the same detectable marker as the second hybridization probe; and a fourth hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a second portion of the target sequence to which the third hybridization probe does not selectively hybridize, the fourth hybridization probe including the first complexing agent for forming a binding pair with the second complexing agent; wherein the first and second hybridization probes are capable of simultaneously hybridizing to the target sequence and the third and fourth hybridization probes are capable of simultaneously hybridizing to the target sequence, the detectable marker is not present on the first or fourth hybridization probes and the first, second, third, and fourth hybridization probes each include a competitive nucleic acid sequence which is sufficiently complementary to a third portion of the target sequence that the competitive sequences of the first, second, third, and fourth hybridization probes compete with each other to hybridize to the third portion of the

  18. The amino acid sequence of wood duck lysozyme.

    PubMed

    Araki, T; Torikata, T

    1999-01-01

    The amino acid sequence of wood duck (Aix sponsa) lysozyme was analyzed. Carboxymethylated lysozyme was digested with trypsin and the resulting peptides were sequenced. The established amino acid sequence had the highest similarity to duck III lysozyme with four amino acid substitutions, and had eighteen amino acid substitutions from chicken lysozyme. The valine at position 75 was newly detected in chicken-type lysozymes. In the active site, Tyr34 and Glu57 were found at subsites F and D, respectively, when compared with chicken lysozyme.

  19. Solid phase sequencing of double-stranded nucleic acids

    DOEpatents

    Fu, Dong-Jing; Cantor, Charles R.; Koster, Hubert; Smith, Cassandra L.

    2002-01-01

    This invention relates to methods for detecting and sequencing of target double-stranded nucleic acid sequences, to nucleic acid probes and arrays of probes useful in these methods, and to kits and systems which contain these probes. Useful methods involve hybridizing the nucleic acids or nucleic acids which represent complementary or homologous sequences of the target to an array of nucleic acid probes. These probe comprise a single-stranded portion, an optional double-stranded portion and a variable sequence within the single-stranded portion. The molecular weights of the hybridized nucleic acids of the set can be determined by mass spectroscopy, and the sequence of the target determined from the molecular weights of the fragments. Nucleic acids whose sequences can be determined include nucleic acids in biological samples such as patient biopsies and environmental samples. Probes may be fixed to a solid support such as a hybridization chip to facilitate automated determination of molecular weights and identification of the target sequence.

  20. Selection of sequence variants to improve dairy cattle genomic predictions

    USDA-ARS?s Scientific Manuscript database

    Genomic prediction reliabilities improved when adding selected sequence variants from run 5 of the 1,000 bull genomes project. High density (HD) imputed genotypes for 26,970 progeny tested Holstein bulls were combined with sequence variants for 444 Holstein animals. The first test included 481,904 c...

  1. The complete amino acid sequence of yeast phosphoglycerate kinase.

    PubMed Central

    Perkins, R E; Conroy, S C; Dunbar, B; Fothergill, L A; Tuite, M F; Dobson, M J; Kingsman, S M; Kingsman, A J

    1983-01-01

    The complete amino acid sequence of yeast phosphoglycerate kinase, comprising 415 residues, was determined. The sequence of residues 1-173 was deduced mainly from nucleotide sequence analysis of a series of overlapping fragments derived from the relevant portion of a 2.95-kilobase endonuclease-HindIII-digest fragment containing the yeast phosphoglycerate kinase gene. The sequence of residues 174-415 was deduced mainly from amino acid sequence analysis of three CNBr-cleavage fragments, and from peptides derived from these fragments after digestion by a number of proteolytic enzymes. Cleavage at the two tryptophan residues with o-iodosobenzoic acid was also used to isolate fragments suitable for amino acid sequence analysis. Determination of the complete sequence now allows a detailed interpretation of the existing high-resolution X-ray-crystallographic structure. The sequence -Ile-Ile-Gly-Gly-Gly- occurs twice in distant parts of the linear sequence (residues 232-236 and 367-371). Both these regions contribute to the nucleoside phosphate-binding site. A comparison of the sequence of yeast phosphoglycerate kinase reported here with the sequences of phosphoglycerate kinase from horse muscle and human erythrocytes shows that the yeast enzyme is 64% identical with the mammalian enzymes. The yeast has strikingly fewer methionine, cysteine and tryptophan residues. PMID:6347186

  2. Soil amino acid composition across a boreal forest successional sequence

    Treesearch

    Nancy R. Werdin-Pfisterer; Knut Kielland; Richard D. Boone

    2009-01-01

    Soil amino acids are important sources of organic nitrogen for plant nutrition, yet few studies have examined which amino acids are most prevalent in the soil. In this study, we examined the composition, concentration, and seasonal patterns of soil amino acids across a primary successional sequence encompassing a natural gradient of plant productivity and soil...

  3. Analysis of cloned cDNA and genomic sequences for phytochrome: complete amino acid sequences for two gene products expressed in etiolated Avena.

    PubMed Central

    Hershey, H P; Barker, R F; Idler, K B; Lissemore, J L; Quail, P H

    1985-01-01

    Cloned cDNA and genomic sequences have been analyzed to deduce the amino acid sequence of phytochrome from etiolated Avena. Restriction endonuclease site polymorphism between clones indicates that at least four phytochrome genes are expressed in this tissue. Sequence analysis of two complete and one partial coding region shows approximately 98% homology at both the nucleotide and amino acid levels, with the majority of amino acid changes being conservative. High sequence homology is also found in the 5'-untranslated region but significant divergence occurs in the 3'-untranslated region. The phytochrome polypeptides are 1128 amino acid residues long corresponding to a molecular mass of 125 kdaltons. The known protein sequence at the chromophore attachment site occurs only once in the polypeptide, establishing that phytochrome has a single chromophore per monomer covalently linked to Cys-321. Computer analyses of the amino acid sequences have provided predictions regarding a number of structural features of the phytochrome molecule. PMID:3001642

  4. An Integrated Sequence-Structure Database incorporating matching mRNA sequence, amino acid sequence and protein three-dimensional structure data.

    PubMed Central

    Adzhubei, I A; Adzhubei, A A; Neidle, S

    1998-01-01

    We have constructed a non-homologous database, termed the Integrated Sequence-Structure Database (ISSD) which comprises the coding sequences of genes, amino acid sequences of the corresponding proteins, their secondary structure and straight phi,psi angles assignments, and polypeptide backbone coordinates. Each protein entry in the database holds the alignment of nucleotide sequence, amino acid sequence and the PDB three-dimensional structure data. The nucleotide and amino acid sequences for each entry are selected on the basis of exact matches of the source organism and cell environment. The current version 1.0 of ISSD is available on the WWW at http://www.protein.bio.msu.su/issd/ and includes 107 non-homologous mammalian proteins, of which 80 are human proteins. The database has been used by us for the analysis of synonymous codon usage patterns in mRNA sequences showing their correlation with the three-dimensional structure features in the encoded proteins. Possible ISSD applications include optimisation of protein expression, improvement of the protein structure prediction accuracy, and analysis of evolutionary aspects of the nucleotide sequence-protein structure relationship. PMID:9399866

  5. A graphic representation of protein sequence and predicting the subcellular locations of prokaryotic proteins.

    PubMed

    Feng, Zhi-Ping; Zhang, Chun-Ting

    2002-03-01

    Zp curve, a three-dimensional space curve representation of protein primary sequence based on the hydrophobicity and charged properties of amino acid residues along the primary sequence is suggested. Relying on the Zp parameters extracted from the three components of the Zp curve and the Bayes discriminant algorithm, the subcellular locations of prokaryotic proteins were predicted. Consequently, an accuracy of 81.5% in the cross-validation test has been achieved using 13 parameters extracted from the curve for the database of 997 prokaryotic proteins. The result is slightly better than that of using the neural network method (80.9%) based on the amino acid composition for the same database. By jointing the amino acid composition and the Zp parameters, the overall predictive accuracy 89.6% can be achieved. It is about 3% higher than that of the Bayes discriminant algorithm based merely on the amino acid composition for the same database. The prediction is also performed with a larger dataset derived from the version 39 SWISS-PROT databank and two datasets with different sequence similarity. Even for the dataset of non-sequence similarity, the improvement can be of 4.4% in the cross-validation test. The results indicate that the Zp parameters are effective in representing the information within a protein primary sequence. The method of extracting information from the primary structure may be useful for other areas of protein studies.

  6. The influence of visual training on predicting complex action sequences.

    PubMed

    Cross, Emily S; Stadler, Waltraud; Parkinson, Jim; Schütz-Bosbach, Simone; Prinz, Wolfgang

    2013-02-01

    Linking observed and executable actions appears to be achieved by an action observation network (AON), comprising parietal, premotor, and occipitotemporal cortical regions of the human brain. AON engagement during action observation is thought to aid in effortless, efficient prediction of ongoing movements to support action understanding. Here, we investigate how the AON responds when observing and predicting actions we cannot readily reproduce before and after visual training. During pre- and posttraining neuroimaging sessions, participants watched gymnasts and wind-up toys moving behind an occluder and pressed a button when they expected each agent to reappear. Between scanning sessions, participants visually trained to predict when a subset of stimuli would reappear. Posttraining scanning revealed activation of inferior parietal, superior temporal, and cerebellar cortices when predicting occluded actions compared to perceiving them. Greater activity emerged when predicting untrained compared to trained sequences in occipitotemporal cortices and to a lesser degree, premotor cortices. The occipitotemporal responses when predicting untrained agents showed further specialization, with greater responses within body-processing regions when predicting gymnasts' movements and in object-selective cortex when predicting toys' movements. The results suggest that (1) select portions of the AON are recruited to predict the complex movements not easily mapped onto the observer's body and (2) greater recruitment of these AON regions supports prediction of less familiar sequences. We suggest that the findings inform both the premotor model of action prediction and the predictive coding account of AON function.

  7. Can computationally designed protein sequences improve secondary structure prediction?

    PubMed

    Bondugula, Rajkumar; Wallqvist, Anders; Lee, Michael S

    2011-05-01

    Computational sequence design methods are used to engineer proteins with desired properties such as increased thermal stability and novel function. In addition, these algorithms can be used to identify an envelope of sequences that may be compatible with a particular protein fold topology. In this regard, we hypothesized that sequence-property prediction, specifically secondary structure, could be significantly enhanced by using a large database of computationally designed sequences. We performed a large-scale test of this hypothesis with 6511 diverse protein domains and 50 designed sequences per domain. After analysis of the inherent accuracy of the designed sequences database, we realized that it was necessary to put constraints on what fraction of the native sequence should be allowed to change. With mutational constraints, accuracy was improved vs. no constraints, but the diversity of designed sequences, and hence effective size of the database, was moderately reduced. Overall, the best three-state prediction accuracy (Q(3)) that we achieved was nearly a percentage point improved over using a natural sequence database alone, well below the theoretical possibility for improvement of 8-10 percentage points. Furthermore, our nascent method was used to augment the state-of-the-art PSIPRED program by a percentage point.

  8. Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.

    PubMed

    Yehdego, Daniel T; Zhang, Boyu; Kodimala, Vikram K R; Johnson, Kyle L; Taufer, Michela; Leung, Ming-Ying

    2013-05-01

    Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.

  9. Amino acid sequence of mouse submaxillary gland renin.

    PubMed Central

    Misono, K S; Chang, J J; Inagami, T

    1982-01-01

    The complete amino acid sequences of the heavy chain and light chain of mouse submaxillary gland renin have been determined. The heavy chain consists of 288 amino acid residues having a Mr of 31,036 calculated from the sequence. The light chain contains 48 amino acid residues with a Mr of 5,458. The sequence of the heavy chain was determined by automated Edman degradations of the cyanogen bromide peptides and tryptic peptides generated after citraconylation, as well as other peptides generated therefrom. The sequence of the light chain was derived from sequence analyses of the peptides generated by cyanogen bromide cleavage or by digestion with Staphylococcus aureus protease. The sequences in the active site regions in renin containing two catalytically essential aspartyl residues 32 and 215 were found identical with those in pepsin, chymosin, and penicillopepsin. Comparison of the amino acid sequence of renin with that of porcine pepsin indicated a 42% sequence identity of the heavy chain with the amino-terminal and middle regions and a 46% identity of the light chain with the carboxyl-terminal region of the porcine pepsin sequence. Residues identical in renin and pepsin are distributed throughout the length of the molecules, suggesting a similarity in their overall structures. PMID:6812055

  10. Bovine testis acylphosphatase: purification and amino acid sequence.

    PubMed

    Pazzagli, L; Cappugi, G; Camici, G; Manao, G; Ramponi, G

    1993-10-01

    Two acylphosphatase molecular forms have been isolated from bovine testis. Their amino acid sequence was determined. One (ACY1) consists of 98 amino acid residues, while the other one (ACY2) consists of 100 amino acid residues. Both molecular forms are N-acetylated and differ only in the amino terminus. ACY2 has an additional Ser-Met tail with respect to ACY1. Both ACY1 and ACY2 are organ-common type isoenzymes and thus differ for about half of the amino acid positions from the previously sequenced bovine muscle isoenzyme.

  11. Learning predictive statistics from temporal sequences: Dynamics and strategies.

    PubMed

    Wang, Rui; Shen, Yuan; Tino, Peter; Welchman, Andrew E; Kourtzi, Zoe

    2017-10-01

    Human behavior is guided by our expectations about the future. Often, we make predictions by monitoring how event sequences unfold, even though such sequences may appear incomprehensible. Event structures in the natural environment typically vary in complexity, from simple repetition to complex probabilistic combinations. How do we learn these structures? Here we investigate the dynamics of structure learning by tracking human responses to temporal sequences that change in structure unbeknownst to the participants. Participants were asked to predict the upcoming item following a probabilistic sequence of symbols. Using a Markov process, we created a family of sequences, from simple frequency statistics (e.g., some symbols are more probable than others) to context-based statistics (e.g., symbol probability is contingent on preceding symbols). We demonstrate the dynamics with which individuals adapt to changes in the environment's statistics-that is, they extract the behaviorally relevant structures to make predictions about upcoming events. Further, we show that this structure learning relates to individual decision strategy; faster learning of complex structures relates to selection of the most probable outcome in a given context (maximizing) rather than matching of the exact sequence statistics. Our findings provide evidence for alternate routes to learning of behaviorally relevant statistics that facilitate our ability to predict future events in variable environments.

  12. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information.

    PubMed

    Ma, Xin; Guo, Jing; Liu, Hong-De; Xie, Jian-Ming; Sun, Xiao

    2012-01-01

    The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew’s correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.

  13. Delineation of modular proteins: domain boundary prediction from sequence information.

    PubMed

    Kong, Lesheng; Ranganathan, Shoba

    2004-06-01

    The delineation of domain boundaries of a given sequence in the absence of known 3D structures or detectable sequence homology to known domains benefits many areas in protein science, such as protein engineering, protein 3D structure determination and protein structure prediction. With the exponential growth of newly determined sequences, our ability to predict domain boundaries rapidly and accurately from sequence information alone is both essential and critical from the viewpoint of gene function annotation. Anyone attempting to predict domain boundaries for a single protein sequence is invariably confronted with a plethora of databases that contain boundary information available from the internet and a variety of methods for domain boundary prediction. How are these derived and how well do they work? What definition of 'domain' do they use? We will first clarify the different definitions of protein domains, and then describe the available public databases with domain boundary information. Finally, we will review existing domain boundary prediction methods and discuss their strengths and weaknesses.

  14. Amino Acid Sequence of Human Cholinesterase

    DTIC Science & Technology

    1985-10-01

    liquid chromatography (HPLC). Activity testing of the aged, DFP-labeled cholinesterase showed that 99.8% of the active sites had been labeled, since...acids were quantitated by ninhydrin at the AAA Labs, or by derivatization with phenylisothiocyanate at the University of Michigan. The latter method

  15. Prediction of fine-tuned promoter activity from DNA sequence

    PubMed Central

    Siwo, Geoffrey; Rider, Andrew; Tan, Asako; Pinapati, Richard; Emrich, Scott; Chawla, Nitesh; Ferdig, Michael

    2016-01-01

    The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring

  16. Prediction of fine-tuned promoter activity from DNA sequence.

    PubMed

    Siwo, Geoffrey; Rider, Andrew; Tan, Asako; Pinapati, Richard; Emrich, Scott; Chawla, Nitesh; Ferdig, Michael

    2016-01-01

    The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring

  17. Learned spatiotemporal sequence recognition and prediction in primary visual cortex

    PubMed Central

    Gavornik, Jeffrey P.; Bear, Mark F.

    2014-01-01

    Learning to recognize and predict temporal sequences is fundamental to sensory perception, and is impaired in several neuropsychiatric disorders, but little is known about where and how this occurs in the brain. We discovered that repeated presentations of a visual sequence over a course of days causes evoked response potentiation in mouse V1 that is highly specific for stimulus order and timing. Remarkably, after V1 is trained to recognize a sequence, cortical activity regenerates the full sequence even when individual stimulus elements are omitted. This novel neurophysiological report of sequence learning advances the understanding of how the brain makes “intelligent guesses” based on limited information to form visual percepts and suggests that it is possible to study the mechanistic basis of this high–level cognitive ability by studying low–level sensory systems. PMID:24657967

  18. Cystatin. Amino acid sequence and possible secondary structure.

    PubMed Central

    Schwabe, C; Anastasi, A; Crow, H; McDonald, J K; Barrett, A J

    1984-01-01

    The amino acid sequence of cystatin, the protein from chicken egg-white that is a tight-binding inhibitor of many cysteine proteinases, is reported. Cystatin is composed of 116 amino acid residues, and the Mr is calculated to be 13 143. No striking similarity to any other known sequence has been detected. The results of computer analysis of the sequence and c.d. spectrometry indicate that the secondary structure includes relatively little alpha-helix (about 20%) and that the remainder is mainly beta-structure. PMID:6712597

  19. Prediction of carbohydrate-binding proteins from sequences using support vector machines.

    PubMed

    Someya, Seizi; Kakuta, Masanori; Morita, Mizuki; Sumikoshi, Kazuya; Cao, Wei; Ge, Zhenyi; Hirose, Osamu; Nakamura, Shugo; Terada, Tohru; Shimizu, Kentaro

    2010-01-01

    Carbohydrate-binding proteins are proteins that can interact with sugar chains but do not modify them. They are involved in many physiological functions, and we have developed a method for predicting them from their amino acid sequences. Our method is based on support vector machines (SVMs). We first clarified the definition of carbohydrate-binding proteins and then constructed positive and negative datasets with which the SVMs were trained. By applying the leave-one-out test to these datasets, our method delivered 0.92 of the area under the receiver operating characteristic (ROC) curve. We also examined two amino acid grouping methods that enable effective learning of sequence patterns and evaluated the performance of these methods. When we applied our method in combination with the homology-based prediction method to the annotated human genome database, H-invDB, we found that the true positive rate of prediction was improved.

  20. Mouse Vk gene classification by nucleic acid sequence similarity.

    PubMed

    Strohal, R; Helmberg, A; Kroemer, G; Kofler, R

    1989-01-01

    Analyses of immunoglobulin (Ig) variable (V) region gene usage in the immune response, estimates of V gene germline complexity, and other nucleic acid hybridization-based studies depend on the extent to which such genes are related (i.e., sequence similarity) and their organization in gene families. While mouse Igh heavy chain V region (VH) gene families are relatively well-established, a corresponding systematic classification of Igk light chain V region (Vk) genes has not been reported. The present analysis, in the course of which we reviewed the known extent of the Vk germline gene repertoire and Vk gene usage in a variety of responses to foreign and self antigens, provides a classification of mouse Vk genes in gene families composed of members with greater than 80% overall nucleic acid sequence similarity. This classification differed in several aspects from that of VH genes: only some Vk gene families were as clearly separated (by greater than 25% sequence dissimilarity) as typical VH gene families; most Vk gene families were closely related and, in several instances, members from different families were very similar (greater than 80%) over large sequence portions; frequently, classification by nucleic acid sequence similarity diverged from existing classifications based on amino-terminal protein sequence similarity. Our data have implications for Vk gene analyses by nucleic acid hybridization and describe potentially important differences in sequence organization between VH and Vk genes.

  1. Prediction and prioritization of neoantigens: integration of RNA sequencing data with whole-exome sequencing.

    PubMed

    Karasaki, Takahiro; Nagayama, Kazuhiro; Kuwano, Hideki; Nitadori, Jun-Ichi; Sato, Masaaki; Anraku, Masaki; Hosoi, Akihiro; Matsushita, Hirokazu; Takazawa, Masaki; Ohara, Osamu; Nakajima, Jun; Kakimi, Kazuhiro

    2017-02-01

    The importance of neoantigens for cancer immunity is now well-acknowledged. However, there are diverse strategies for predicting and prioritizing candidate neoantigens, and thus reported neoantigen loads vary a great deal. To clarify this issue, we compared the numbers of neoantigen candidates predicted by four currently utilized strategies. Whole-exome sequencing and RNA sequencing (RNA-Seq) of four non-small-cell lung cancer patients was carried out. We identified 361 somatic missense mutations from which 224 candidate neoantigens were predicted using MHC class I binding affinity prediction software (strategy I). Of these, 207 exceeded the set threshold of gene expression (fragments per kilobase of transcript per million fragments mapped ≥1), resulting in 124 candidate neoantigens (strategy II). To verify mutant mRNA expression, sequencing of amplicons from tumor cDNA including each mutation was undertaken; 204 of the 207 mutations were successfully sequenced, yielding 121 mutant mRNA sequences, resulting in 75 candidate neoantigens (strategy III). Sequence information was extracted from RNA-Seq to confirm the presence of mutated mRNA. Variant allele frequencies ≥0.04 in RNA-Seq were found for 117 of the 207 mutations and regarded as expressed in the tumor, and finally, 72 candidate neoantigens were predicted (strategy IV). Without additional amplicon sequencing of cDNA, strategy IV was comparable to strategy III. We therefore propose strategy IV as a practical and appropriate strategy to predict candidate neoantigens fully utilizing currently available information. It is of note that different neoantigen loads were deduced from the same tumors depending on the strategies applied.

  2. Predicting promoter activities of primary human DNA sequences

    PubMed Central

    Irie, Takuma; Park, Sung-Joon; Yamashita, Riu; Seki, Masahide; Yada, Tetsushi; Sugano, Sumio; Nakai, Kenta; Suzuki, Yutaka

    2011-01-01

    We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed. PMID:21486745

  3. Amino acid sequence of toxin III from Anemonia sulcata.

    PubMed

    Bĕress, L; Wunderer, G; Wachter, E

    1977-08-01

    Toxin III, the smallest toxin component of the poison of the sea anemone Anemonia sulcata, is a polypeptide with 27 amino acids. Its structure is stabilized by three disulfide bridges. The amino acid sequence was determined by solid-phase Edman degradation of the aminoethylated derivative. The peptide was coupled to the carrier, porous glass, by thiourea bridges between the alpha-amino group of arginine-1 and the epsilon-amino group of lysine-26 and the isothiocyanate groups of the carrier. Another fraction of the polypeptide was bound by an acid-amide condensation of the C-terminal valine-27 with the aminopropyl group of the carrier. The sequence of toxin III has no regions homologous to the 47-residue toxin II. Comparison with the known partial sequence of toxin I, which contains 46 amino acids (Wunderer, G. & Eulitz, M., in preparation) also fails to reveal homologies.

  4. Amino acid sequence of fibrolase, a direct-acting fibrinolytic enzyme from Agkistrodon contortrix contortrix venom.

    PubMed Central

    Randolph, A.; Chamberlain, S. H.; Chu, H. L.; Retzios, A. D.; Markland, F. S.; Masiarz, F. R.

    1992-01-01

    The complete amino acid sequence of fibrolase, a fibrinolytic enzyme from southern copperhead (Agkistrodon contortrix contortrix) venom, has been determined. This is the first report of the sequence of a direct-acting, nonhemorrhagic fibrinolytic enzyme found in snake venom. The majority of the sequence was established by automated Edman degradation of overlapping peptides generated by a variety of selective cleavage procedures. The amino-terminus is blocked by a cyclized glutamine (pyroglutamic acid) residue, and the sequence of this region of the molecule was determined by mass spectrometry. Fibrolase is composed of 203 residues in a single polypeptide chain with a molecular weight of 22,891, as determined by the sequence. Its sequence is homologous to the sequence of the hemorrhagic toxin Ht-d of Crotalus atrox venom and with the sequences of two metalloproteinases from Trimeresurus flavoviridis venom. Microheterogeneity in the sequence was found at both the amino-terminus and at residues 189 and 192. All six cysteine residues in fibrolase are involved in disulfide bonds. A disulfide bond between cysteine-118 and cysteine-198 has been established and bonds between cysteines-158/165 and between cysteines-160/192 are inferred from the homology to Ht-d. Secondary structure prediction reveals a very low percentage of alpha-helix (4%), but much greater beta-structure (39.5%). Analysis of the sequence reveals the absence of asparagine-linked glycosylation sites defined by the consensus sequence: asparagine-X-serine/threonine. PMID:1304358

  5. Draft Genome Sequence of Bacillus coagulans NL01, a Wonderful l-Lactic Acid Producer

    PubMed Central

    Zheng, Zhaojuan; Jiang, Ting; Lin, Xi; Zhou, Jie

    2015-01-01

    Here, we report the draft genome sequence of Bacillus coagulans NL01, which could produce high optically pure l-lactic acid using xylose as a sole carbon source. The draft genome is 3,505,081 bp, with 144 contigs. About 3,903 protein-coding genes and 92 rRNAs are predicted from this assembly. PMID:26089419

  6. Shark myelin basic protein: amino acid sequence, secondary structure, and self-association.

    PubMed

    Milne, T J; Atkins, A R; Warren, J A; Auton, W P; Smith, R

    1990-09-01

    Myelin basic protein (MBP) from the Whaler shark (Carcharhinus obscurus) has been purified from acid extracts of a chloroform/methanol pellet from whole brains. The amino acid sequence of the majority of the protein has been determined and compared with the sequences of other MBPs. The shark protein has only 44% homology with the bovine protein, but, in common with other MBPs, it has basic residues distributed throughout the sequence and no extensive segments that are predicted to have an ordered secondary structure in solution. Shark MBP lacks the triproline sequence previously postulated to form a hairpin bend in the molecule. The region containing the putative consensus sequence for encephalitogenicity in the guinea pig contains several substitutions, thus accounting for the lack of activity of the shark protein. Studies of the secondary structure and self-association have shown that shark MBP possesses solution properties similar to those of the bovine protein, despite the extensive differences in primary structure.

  7. A Fast Algorithm for Exonic Regions Prediction in DNA Sequences

    PubMed Central

    Saberkari, Hamidreza; Shamsi, Mousa; Heravi, Hamed; Sedaaghi, Mohammad Hossein

    2013-01-01

    The main purpose of this paper is to introduce a fast method for gene prediction in DNA sequences based on the period-3 property in exons. First, the symbolic DNA sequences were converted to digital signal using the electron ion interaction potential method. Then, to reduce the effect of background noise in the period-3 spectrum, we used the discrete wavelet transform at three levels and applied it on the input digital signal. Finally, the Goertzel algorithm was used to extract period-3 components in the filtered DNA sequence. The proposed algorithm leads to decrease the computational complexity and hence, increases the speed of the process. Detection of small size exons in DNA sequences, exactly, is another advantage of the algorithm. The proposed algorithm ability in exon prediction was compared with several existing methods at the nucleotide level using: (i) specificity - sensitivity values; (ii) receiver operating curves (ROC); and (iii) area under ROC curve. Simulation results confirmed that the proposed method can be used as a promising tool for exon prediction in DNA sequences. PMID:24672762

  8. Amino acid sequence repertoire of the bacterial proteome and the occurrence of untranslatable sequences

    PubMed Central

    Navon, Sharon Penias; Kornberg, Guy; Chen, Jin; Schwartzman, Tali; Tsai, Albert; Puglisi, Elisabetta Viani; Puglisi, Joseph D.; Adir, Noam

    2016-01-01

    Bioinformatic analysis of Escherichia coli proteomes revealed that all possible amino acid triplet sequences occur at their expected frequencies, with four exceptions. Two of the four underrepresented sequences (URSs) were shown to interfere with translation in vivo and in vitro. Enlarging the URS by a single amino acid resulted in increased translational inhibition. Single-molecule methods revealed stalling of translation at the entrance of the peptide exit tunnel of the ribosome, adjacent to ribosomal nucleotides A2062 and U2585. Interaction with these same ribosomal residues is involved in regulation of translation by longer, naturally occurring protein sequences. The E. coli exit tunnel has evidently evolved to minimize interaction with the exit tunnel and maximize the sequence diversity of the proteome, although allowing some interactions for regulatory purposes. Bioinformatic analysis of the human proteome revealed no underrepresented triplet sequences, possibly reflecting an absence of regulation by interaction with the exit tunnel. PMID:27307442

  9. Amino acid sequence repertoire of the bacterial proteome and the occurrence of untranslatable sequences.

    PubMed

    Navon, Sharon Penias; Kornberg, Guy; Chen, Jin; Schwartzman, Tali; Tsai, Albert; Puglisi, Elisabetta Viani; Puglisi, Joseph D; Adir, Noam

    2016-06-28

    Bioinformatic analysis of Escherichia coli proteomes revealed that all possible amino acid triplet sequences occur at their expected frequencies, with four exceptions. Two of the four underrepresented sequences (URSs) were shown to interfere with translation in vivo and in vitro. Enlarging the URS by a single amino acid resulted in increased translational inhibition. Single-molecule methods revealed stalling of translation at the entrance of the peptide exit tunnel of the ribosome, adjacent to ribosomal nucleotides A2062 and U2585. Interaction with these same ribosomal residues is involved in regulation of translation by longer, naturally occurring protein sequences. The E. coli exit tunnel has evidently evolved to minimize interaction with the exit tunnel and maximize the sequence diversity of the proteome, although allowing some interactions for regulatory purposes. Bioinformatic analysis of the human proteome revealed no underrepresented triplet sequences, possibly reflecting an absence of regulation by interaction with the exit tunnel.

  10. Predicting terrorist actions using sequence learning and past events

    NASA Astrophysics Data System (ADS)

    Ruda, Harald; Das, Subrata K.; Zacharias, Greg L.

    2003-09-01

    This paper describes the application of sequence learning to the domain of terrorist group actions. The goal is to make accurate predictions of future events based on learning from past history. The past history of the group is represented as a sequence of events. Well-established sequence learning approaches are used to generate temporal rules from the event sequence. In order to represent all the possible events involving a terrorist group activities, an event taxonomy has been created that organizes the events into a hierarchical structure. The event taxonomy is applied when events are extracted, and the hierarchical form of the taxonomy is especially useful when only scant information is available about an event. The taxonomy can also be used to generate temporal rules at various levels of abstraction. The generated temporal rules are used to generate predictions that can be compared to actual events for evaluation. The approach was tested on events collected for a four-year period from relevant newspaper articles and other open-source literature. Temporal rules were generated based on the first half of the data, and predictions were generated for the second half of the data. Evaluation yielded a high hit rate and a moderate false-alarm rate.

  11. Amino acid sequences of proteins from Leptospira serovar pomona.

    PubMed

    Alves, S F; Lefebvre, R B; Probert, W

    2000-01-01

    This report describes a partial amino acid sequences from three putative outer envelope proteins from Leptospira serovar pomona. In order to obtain internal fragments for protein sequencing, enzymatic and chemical digestion was performed. The enzyme clostripain was used to digest the proteins 32 and 45 kDa. In situ digestion of 40 kDa molecular weight protein was accomplished using cyanogen bromide. The 32 kDa protein generated two fragments, one of 21 kDa and another of 10 kDa that yielded five residues. A fragment of 24 kDa that yielded nineteen residues of amino acids was obtained from 45 kDa protein. A fragment with a molecular weight of 20 kDa, yielding a twenty amino acids sequence from the 40 kDa protein.

  12. QGRS-H Predictor: a web server for predicting homologous quadruplex forming G-rich sequence motifs in nucleotide sequences

    PubMed Central

    Menendez, Camille; Frees, Scott; Bagga, Paramjeet S.

    2012-01-01

    Naturally occurring G-quadruplex structural motifs, formed by guanine-rich nucleic acids, have been reported in telomeric, promoter and transcribed regions of mammalian genomes. G-quadruplex structures have received significant attention because of growing evidence for their role in important biological processes, human disease and as therapeutic targets. Lately, there has been much interest in the potential roles of RNA G-quadruplexes as cis-regulatory elements of post-transcriptional gene expression. Large-scale computational genomics studies on G-quadruplexes have difficulty validating their predictions without laborious testing in ‘wet’ labs. We have developed a bioinformatics tool, QGRS-H Predictor that can map and analyze conserved putative Quadruplex forming 'G'-Rich Sequences (QGRS) in mRNAs, ncRNAs and other nucleotide sequences, e.g. promoter, telomeric and gene flanking regions. Identifying conserved regulatory motifs helps validate computations and enhances accuracy of predictions. The QGRS-H Predictor is particularly useful for mapping homologous G-quadruplex forming sequences as cis-regulatory elements in the context of 5′- and 3′-untranslated regions, and CDS sections of aligned mRNA sequences. QGRS-H Predictor features highly interactive graphic representation of the data. It is a unique and user-friendly application that provides many options for defining and studying G-quadruplexes. The QGRS-H Predictor can be freely accessed at: http://quadruplex.ramapo.edu/qgrs/app/start. PMID:22576365

  13. Amino acid sequence of porcine spleen cathepsin D.

    PubMed Central

    Shewale, J G; Tang, J

    1984-01-01

    The amino acid sequence of porcine spleen cathepsin D heavy chain has been determined and, hence, the complete structure of this enzyme is now known. The sequence of heavy chain was constructed by aligning the structures of peptides generated by cyanogen bromide, trypsin, and endo-proteinase Lys C cleavages. The structure of the light chain has been published previously. The cathepsin D molecule contains 339 amino acid residues in two polypeptide chains: a 97-residue light chain and a 242-residue heavy chain, with a combined Mr of 36,779 (without carbohydrate). There are two carbohydrate units linked to asparagine residues 70 and 192. The disulfide bond arrangement in cathepsin D is probably similar to that of pepsin, because the positions of six half-cystine residues are conserved. The active site aspartyl residues, corresponding to aspartic acid-32 and -215 of pepsin, are located at residues 33 and 224 in the cathepsin D molecule. The amino acid sequence around these aspartyl residues is strongly conserved. Cathepsin D shows a strong homology with other acid proteases. When the sequence of cathepsin D, renin, and pepsin are aligned, 32.7% of the residues are identical. The homology is observed throughout the length of the molecules, indicating that three-dimensional structures of all three molecules are similar. PMID:6587385

  14. Amino acid sequences of bacterial cytochromes c' and c-556.

    PubMed Central

    Ambler, R P; Bartsch, R G; Daniel, M; Kamen, M D; McLellan, L; Meyer, T E; Van Beeumen, J

    1981-01-01

    The cytochrome c' are electron transport proteins widely distributed in photosynthetic and aerobic bacteria. We report the amino acid sequences of the proteins from 12 different bacterial species, and we show by sequences that the cytochromes c-556 from 2 different bacteria are structurally related to the cytochromes c'. Unlike the mitochondrial cytochromes c, the heme binding site in the cytochromes c' and c-556 is near the COOH terminus. The cytochromes c-556 probably have a methionine sixth heme ligand located near the NH2 terminus, whereas the cytochromes c' may be pentacoordinate. Quantitative comparison of cytochrome c' and c-556 sequences indicates a relatively low 28% average identity. PMID:6273892

  15. Sequencing and computational analysis of complete genome sequences of Citrus yellow mosaic badna virus from acid lime and pummelo.

    PubMed

    Borah, Basanta K; Johnson, A M Anthony; Sai Gopal, D V R; Dasgupta, Indranil

    2009-08-01

    Citrus yellow mosaic badna virus (CMBV), a member of the Family Caulimoviridae, Genus Badnavirus, is the causative agent of Citrus mosaic disease in India. Although the virus has been detected in several citrus species, only two full-length genomes, one each from Sweet orange and Rangpur lime, are available in publicly accessible databases. In order to obtain a better understanding of the genetic variability of the virus in other citrus mosaic-affected citrus species, we performed the cloning and sequence analysis of complete genomes of CMBV from two additional citrus species, Acid lime and Pummelo. We show that CMBV genomes from the two hosts share high homology with previously reported CMBV sequences and hence conclude that the new isolates represent variants of the virus present in these species. Based on in silico sequence analysis, we predict the possible function of the protein encoded by one of the five ORFs.

  16. Prediction of uridine modifications in tRNA sequences.

    PubMed

    Panwar, Bharat; Raghava, Gajendra P S

    2014-10-02

    In past number of methods have been developed for predicting post-translational modifications in proteins. In contrast, limited attempt has been made to understand post-transcriptional modifications. Recently it has been shown that tRNA modifications play direct role in the genome structure and codon usage. This study is an attempt to understand kingdom-wise tRNA modifications particularly uridine modifications (UMs), as majority of modifications are uridine-derived. A three-steps strategy has been applied to develop an efficient method for the prediction of UMs. In the first step, we developed a common prediction model for all the kingdoms using a dataset from MODOMICS-2008. Support Vector Machine (SVM) based prediction models were developed and evaluated by five-fold cross-validation technique. Different approaches were applied and found that a hybrid approach of binary and structural information achieved highest Area under the curve (AUC) of 0.936. In the second step, we used newly added tRNA sequences (as independent dataset) of MODOMICS-2012 for the kingdom-wise prediction performance evaluation of previously developed (in the first step) common model and achieved performances between the AUC of 0.910 to 0.949. In the third and last step, we used different datasets from MODOMICS-2012 for the kingdom-wise individual prediction models development and achieved performances between the AUC of 0.915 to 0.987. The hybrid approach is efficient not only to predict kingdom-wise modifications but also to classify them into two most prominent UMs: Pseudouridine (Y) and Dihydrouridine (D). A webserver called tRNAmod (http://crdd.osdd.net/raghava/trnamod/) has been developed, which predicts UMs from both tRNA sequences and whole genome.

  17. Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines

    PubMed Central

    Tang, Zhi Qun; Lin, Hong Huang; Zhang, Hai Lei; Han, Lian Yi; Chen, Xin; Chen, Yu Zong

    2007-01-01

    Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented. PMID:20066123

  18. Proteus mirabilis fimbriae: N-terminal amino acid sequence of a major fimbrial subunit and nucleotide sequences of the genes from two strains.

    PubMed

    Bahrani, F K; Cook, S; Hull, R A; Massad, G; Mobley, H L

    1993-03-01

    Proteus mirabilis, a common cause of urinary tract infection in hospitalized and catheterized patients, produces mannose-resistant/klebsiella-like (MR/K) and mannose-resistant/proteus-like (MR/P) hemagglutinins. The gene encoding the major structural subunit of a fimbria, possibly MR/K, was identified in two strains. A degenerate oligonucleotide probe based on the N terminus of the Proteus uroepithelial cell adhesin and antiserum raised against the denatured polypeptide were used to screen a cosmid gene bank of strain HU1069. A cosmid clone that reacted with the probe and antiserum was identified, and a fimbria-like open reading frame was determined by nucleotide sequencing. The predicted N-terminal amino acid sequence of the processed polypeptide, ENETPAPKVSSTKGEIQLKG (residues 23 to 42), did not match the uroepithelial cell adhesin N terminus but, rather, matched exactly the N-terminal amino acid sequence of a polypeptide with an apparent molecular size of 19.5 kDa isolated by sodium dodecyl sulfate-polyacrylamide gel electrophoresis of a fimbrial preparation from strain HI4320 expressing MR/K hemagglutinin. By using an oligonucleotide from the HU1069 open reading frame, the fimbrial gene was isolated and sequenced from a cosmid gene bank clone of strain HI4320. A 552-bp open reading frame predicts a 184-amino-acid polypeptide including a 22-amino-acid hydrophobic leader sequence. The unprocessed polypeptide is predicted to be 18,921 Da; the processed polypeptide is predicted to be 16,749 Da. The predicted amino acid sequence of the polypeptide encoded by the gene, designated pmfA, displayed 36% exact matches with the mannose-resistant fimbrial subunit encoded by smfA of Serratia marcescens but only 15% exact matches with the predicted sequence encoded by mrkA of Klebsiella pneumoniae.

  19. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes.

    PubMed

    Lin, Hao; Chen, Wei; Ding, Hui

    2013-01-01

    The structure and activity of enzymes are influenced by pH value of their surroundings. Although many enzymes work well in the pH range from 6 to 8, some specific enzymes have good efficiencies only in acidic (pH<5) or alkaline (pH>9) solution. Studies have demonstrated that the activities of enzymes correlate with their primary sequences. It is crucial to judge enzyme adaptation to acidic or alkaline environment from its amino acid sequence in molecular mechanism clarification and the design of high efficient enzymes. In this study, we developed a sequence-based method to discriminate acidic enzymes from alkaline enzymes. The analysis of variance was used to choose the optimized discriminating features derived from g-gap dipeptide compositions. And support vector machine was utilized to establish the prediction model. In the rigorous jackknife cross-validation, the overall accuracy of 96.7% was achieved. The method can correctly predict 96.3% acidic and 97.1% alkaline enzymes. Through the comparison between the proposed method and previous methods, it is demonstrated that the proposed method is more accurate. On the basis of this proposed method, we have built an online web-server called AcalPred which can be freely accessed from the website (http://lin.uestc.edu.cn/server/AcalPred). We believe that the AcalPred will become a powerful tool to study enzyme adaptation to acidic or alkaline environment.

  20. A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences.

    PubMed

    Lu, Jin-Long; Hu, Xue-Hai; Hu, Dong-Gang

    2012-01-21

    Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 degree plays a major role in helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm and fractal dimension, and then predict the DNA sequence thermostability by these fractal features and support vector machine (SVM). We have conducted experiments on three groups: 17-dimensional vector, 65-dimensional vector, and 257-dimensional vector. Each group is evaluated by the 10-fold cross-validation test. For the results, the group of 257-dimensional vector gets the best results: the average accuracy is 0.9456 and average MCC is 0.8878. The results are also compared with the previous work with single CGR features. The comparison shows the high effectiveness of the new hybrid fractal algorithm.

  1. Predicting Motor Sequence Learning in Individuals With Chronic Stroke.

    PubMed

    Wadden, Katie P; Asis, Kristopher De; Mang, Cameron S; Neva, Jason L; Peters, Sue; Lakhani, Bimal; Boyd, Lara A

    2017-01-01

    Conventionally, change in motor performance is quantified with discrete measures of behavior taken pre- and postpractice. As a high degree of movement variability exists in motor performance after stroke, pre- and posttesting of motor skill may lack sensitivity to predict potential for motor recovery. Evaluate the use of predictive models of motor learning based on individual performance curves and clinical characteristics of motor function in individuals with stroke. Ten healthy and fourteen individuals with chronic stroke performed a continuous joystick-based tracking task over 6 days, and at a 24-hour delayed retention test, to assess implicit motor sequence learning. Individuals with chronic stroke demonstrated significantly slower rates of improvements in implicit sequence-specific motor performance compared with a healthy control (HC) group when root mean squared error performance data were fit to an exponential function. The HC group showed a positive relationship between a faster rate of change in implicit sequence-specific motor performance during practice and superior performance at the delayed retention test. The same relationship was shown for individuals with stroke only after accounting for overall motor function by including Wolf Motor Function Test rate in our model. Nonlinear information extracted from multiple time points across practice, specifically the rate of motor skill acquisition during practice, relates strongly with changes in motor behavior at the retention test following practice and could be used to predict optimal doses of practice on an individual basis. © The Author(s) 2016.

  2. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design.

    PubMed

    Ferguson, Andrew L; Mann, Jaclyn K; Omarjee, Saleha; Ndung'u, Thumbi; Walker, Bruce D; Chakraborty, Arup K

    2013-03-21

    A prophylactic or therapeutic vaccine offers the best hope to curb the HIV-AIDS epidemic gripping sub-Saharan Africa, but it remains elusive. A major challenge is the extreme viral sequence variability among strains. Systematic means to guide immunogen design for highly variable pathogens like HIV are not available. Using computational models, we have developed an approach to translate available viral sequence data into quantitative landscapes of viral fitness as a function of the amino acid sequences of its constituent proteins. Predictions emerging from our computationally defined landscapes for the proteins of HIV-1 clade B Gag were positively tested against new in vitro fitness measurements and were consistent with previously defined in vitro measurements and clinical observations. These landscapes chart the peaks and valleys of viral fitness as protein sequences change and inform the design of immunogens and therapies that can target regions of the virus most vulnerable to selection pressure.

  3. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design

    PubMed Central

    Ferguson, Andrew L.; Mann, Jaclyn K.; Omarjee, Saleha; Ndung’u, Thumbi; Walker, Bruce D.; Chakraborty, Arup K.

    2013-01-01

    Summary A prophylactic or therapeutic vaccine offers the best hope to curb the HIV-AIDS epidemic gripping sub-Saharan Africa, but remains elusive. A major challenge is the extreme viral sequence variability among strains. Systematic means to guide immunogen design for highly variable pathogens like HIV are not available. Using computational models, we have developed an approach to translate available viral sequence data into quantitative landscapes of viral fitness as a function of the amino acid sequences of its constituent proteins. Predictions emerging from our computationally defined landscapes for the proteins of HIV-1 clade B Gag were positively tested against new in vitro fitness measurements, and were consistent with previously defined in vitro measurements and clinical observations. These landscapes chart the peaks and valleys of viral fitness as protein sequences change, and inform the design of immunogens and therapies that can target regions of the virus most vulnerable to selection pressure. PMID:23521886

  4. Learning to predict: Exposure to temporal sequences facilitates prediction of future events

    PubMed Central

    Baker, Rosalind; Dexter, Matthew; Hardwicke, Tom E.; Goldstone, Aimee; Kourtzi, Zoe

    2014-01-01

    Previous experience is thought to facilitate our ability to extract spatial and temporal regularities from cluttered scenes. However, little is known about how we may use this knowledge to predict future events. Here we test whether exposure to temporal sequences facilitates the visual recognition of upcoming stimuli. We presented observers with a sequence of leftwards and rightwards oriented gratings that was interrupted by a test stimulus. Observers were asked to indicate whether the orientation of the test stimulus matched their expectation based on the preceding sequence. Our results demonstrate that exposure to temporal sequences without feedback facilitates our ability to predict an upcoming stimulus. In particular, observers’ performance improved following exposure to structured but not random sequences. Improved performance lasted for a prolonged period and generalized to untrained stimulus orientations rather than sequences of different global structure, suggesting that observers acquire knowledge of the sequence structure rather than its items. Further, this learning was compromised when observers performed a dual task resulting in increased attentional load. These findings suggest that exposure to temporal regularities in a scene allows us to accumulate knowledge about its global structure and predict future events. PMID:24231115

  5. Improved nucleic acid descriptors for siRNA efficacy prediction

    PubMed Central

    Sciabola, Simone; Cao, Qing; Orozco, Modesto; Faustino, Ignacio; Stanton, Robert V.

    2013-01-01

    Although considerable progress has been made recently in understanding how gene silencing is mediated by the RNAi pathway, the rational design of effective sequences is still a challenging task. In this article, we demonstrate that including three-dimensional descriptors improved the discrimination between active and inactive small interfering RNAs (siRNAs) in a statistical model. Five descriptor types were used: (i) nucleotide position along the siRNA sequence, (ii) nucleotide composition in terms of presence/absence of specific combinations of di- and trinucleotides, (iii) nucleotide interactions by means of a modified auto- and cross-covariance function, (iv) nucleotide thermodynamic stability derived by the nearest neighbor model representation and (v) nucleic acid structure flexibility. The duplex flexibility descriptors are derived from extended molecular dynamics simulations, which are able to describe the sequence-dependent elastic properties of RNA duplexes, even for non-standard oligonucleotides. The matrix of descriptors was analysed using three statistical packages in R (partial least squares, random forest, and support vector machine), and the most predictive model was implemented in a modeling tool we have made publicly available through SourceForge. Our implementation of new RNA descriptors coupled with appropriate statistical algorithms resulted in improved model performance for the selection of siRNA candidates when compared with publicly available siRNA prediction tools and previously published test sets. Additional validation studies based on in-house RNA interference projects confirmed the robustness of the scoring procedure in prospective studies. PMID:23241392

  6. SVM-PB-Pred: SVM based protein block prediction method using sequence profiles and secondary structures.

    PubMed

    Suresh, V; Parthasarathy, S

    2014-01-01

    We developed a support vector machine based web server called SVM-PB-Pred, to predict the Protein Block for any given amino acid sequence. The input features of SVM-PB-Pred include i) sequence profiles (PSSM) and ii) actual secondary structures (SS) from DSSP method or predicted secondary structures from NPS@ and GOR4 methods. There were three combined input features PSSM+SS(DSSP), PSSM+SS(NPS@) and PSSM+SS(GOR4) used to test and train the SVM models. Similarly, four datasets RS90, DB433, LI1264 and SP1577 were used to develop the SVM models. These four SVM models developed were tested using three different benchmarking tests namely; (i) self consistency, (ii) seven fold cross validation test and (iii) independent case test. The maximum possible prediction accuracy of ~70% was observed in self consistency test for the SVM models of both LI1264 and SP1577 datasets, where PSSM+SS(DSSP) input features was used to test. The prediction accuracies were reduced to ~53% for PSSM+SS(NPS@) and ~43% for PSSM+SS(GOR4) in independent case test, for the SVM models of above two same datasets. Using our method, it is possible to predict the protein block letters for any query protein sequence with ~53% accuracy, when the SP1577 dataset and predicted secondary structure from NPS@ server were used. The SVM-PB-Pred server can be freely accessed through http://bioinfo.bdu.ac.in/~svmpbpred.

  7. Active site amino acid sequence of human factor D.

    PubMed

    Davis, A E

    1980-08-01

    Factor D was isolated from human plasma by chromatography on CM-Sephadex C50, Sephadex G-75, and hydroxylapatite. Digestion of reduced, S-carboxymethylated factor D with cyanogen bromide resulted in three peptides which were isolated by chromatography on Sephadex G-75 (superfine) equilibrated in 20% formic acid. NH2-Terminal sequences were determined by automated Edman degradation with a Beckman 890C sequencer using a 0.1 M Quadrol program. The smallest peptide (CNBr III) consisted of the NH2-terminal 14 amino acids. The other two peptides had molecular weights of 17,000 (CNBr I) and 7000 (CNBr II). Overlap of the NH2-terminal sequence of factor D with the NH2-terminal sequence of CNBr I established the order of the peptides. The NH2-terminal 53 residues of factor D are somewhat more homologous with the group-specific protease of rat intestine than with other serine proteases. The NH2-terminal sequence of CNBr II revealed the active site serine of factor D. The typical serine protease active site sequence (Gly-Asp-Ser-Gly-Gly-Pro was found at residues 12-17. The region surrounding the active site serine does not appear to be more highly homologous with any one of the other serine proteases. The structural data obtained point out the similarities between factor D and the other proteases. However, complete definition of the degree of relationship between factor D and other proteases will require determination of the remainder of the primary structure.

  8. Prediction of ribosome footprint profile shapes from transcript sequences

    PubMed Central

    Liu, Tzu-Yu; Song, Yun S.

    2016-01-01

    Motivation: Ribosome profiling is a useful technique for studying translational dynamics and quantifying protein synthesis. Applications of this technique have shown that ribosomes are not uniformly distributed along mRNA transcripts. Understanding how each transcript-specific distribution arises is important for unraveling the translation mechanism. Results: Here, we apply kernel smoothing to construct predictive features and build a sparse model to predict the shape of ribosome footprint profiles from transcript sequences alone. Our results on Saccharomyces cerevisiae data show that the marginal ribosome densities can be predicted with high accuracy. The proposed novel method has a wide range of applications, including inferring isoform-specific ribosome footprints, designing transcripts with fast translation speeds and discovering unknown modulation during translation. Availability and implementation: A software package called riboShape is freely available at https://sourceforge.net/projects/riboshape Contact: yss@berkeley.edu PMID:27307616

  9. Computational prediction of B cell epitopes from antigen sequences.

    PubMed

    Gao, Jianzhao; Kurgan, Lukasz

    2014-01-01

    Computational identification of B-cell epitopes from antigen chains is a difficult and actively pursued research topic. Efforts towards the development of method for the prediction of linear epitopes span over the last three decades, while only recently several predictors of conformational epitopes were released. We review a comprehensive set of 13 recent approaches that predict linear and 4 methods that predict conformational B-cell epitopes from the antigen sequences. We introduce several databases of B-cell epitopes, since the availability of the corresponding data is at the heart of the development and validation of computational predictors. We also offer practical insights concerning the use and availability of these B-cell epitope predictors, and motivate and discuss feature research in this area.

  10. Predicting sequences and structures of MHC-binding peptides: a computational combinatorial approach

    NASA Astrophysics Data System (ADS)

    Zeng, Jun; Treutlein, Herbert R.; Rudy, George B.

    2001-06-01

    Peptides bound to MHC molecules on the surface of cells convey critical information about the cellular milieu to immune system T cells. Predicting which peptides can bind an MHC molecule, and understanding their modes of binding, are important in order to design better diagnostic and therapeutic agents for infectious and autoimmune diseases. Due to the difficulty of obtaining sufficient experimental binding data for each human MHC molecule, computational modeling of MHC peptide-binding properties is necessary. This paper describes a computational combinatorial design approach to the prediction of peptides that bind an MHC molecule of known X-ray crystallographic or NMR-determined structure. The procedure uses chemical fragments as models for amino acid residues and produces a set of sequences for peptides predicted to bind in the MHC peptide-binding groove. The probabilities for specific amino acids occurring at each position of the peptide are calculated based on these sequences, and these probabilities show a good agreement with amino acid distributions derived from a MHC-binding peptide database. The method also enables prediction of the three-dimensional structure of MHC-peptide complexes. Docking, linking, and optimization procedures were performed with the XPLOR program [1].

  11. The amino acid sequence of iguana (Iguana iguana) pancreatic ribonuclease.

    PubMed

    Zhao, W; Beintema, J J; Hofsteenge, J

    1994-01-15

    The pyrimidine-specific ribonuclease superfamily constitutes a group of homologous proteins so far found only in higher vertebrates. Four separate families are found in mammals, which have resulted from gene duplications in mammalian ancestors. To learn more about the evolutionary history of this superfamily, the primary structure and other characteristics of the pancreatic enzyme from iguana (Iguana iguana), a herbivorous lizard species belonging to the reptiles, have been determined. The polypeptide chain consists of 119 amino acid residues. The positions of insertions and deletions in the sequence are identical to those in the enzyme from snapping turtle. However, the two enzymes differ at 54% of the amino acid positions. Iguana ribonuclease contains no carbohydrate, although the enzyme possesses three recognition sites for carbohydrate attachment, and has a high number of acidic residues in a localized part of the sequence.

  12. Polymer sequencing by molecular machines: a framework for predicting the resolving power of a sliding contact force spectroscopy sequencing method.

    PubMed

    Dunlop, Alex; Bowman, Kate; Aarstad, Olav; Skjåk-Bræk, Gudmund; Stokke, Bjørn T; Round, Andrew N

    2017-10-02

    We evaluate an AFM-based single molecule force spectroscopy method for mapping sequences in otherwise difficult to sequence heteropolymers, including glycosylated proteins and glycans. The sliding contact force spectroscopy (SCFS) method exploits a sliding contact made between a nanopore threaded over a polymer axle and an AFM probe. We find that for sliding α- and β-cyclodextrin nanopores over a wide range of hydrophilic monomers, the free energy of sliding is proportional to the sum of two dimensionless, easily calculable parameters representing the relative partitioning of the monomer inside the nanopore or in the aqueous phase, and the friction arising from sliding the nanopore over the monomer. Using this relationship we calculate sliding energies for nucleic acids, amino acids, glycan and synthetic monomers and predict on the basis of these calculations that SCFS will detect N- and O-glycosylation of proteins and patterns of sidechains in glycans. For these applications, SCFS offers an alternative to sequence mapping by mass spectrometry or newly-emerging nanopore technologies that may be easily implemented using a standard AFM.

  13. Prediction of protein disorder on amino acid substitutions.

    PubMed

    Anoosha, P; Sakthivel, R; Gromiha, M Michael

    2015-12-15

    Intrinsically disordered regions of proteins are known to have many functional roles in cell signaling and regulatory pathways. The altered expression of these proteins due to mutations is associated with various diseases. Currently, most of the available methods focus on predicting the disordered proteins or the disordered regions in a protein. On the other hand, methods developed for predicting protein disorder on mutation showed a poor performance with a maximum accuracy of 70%. Hence, in this work, we have developed a novel method to classify the disorder-related amino acid substitutions using amino acid properties, substitution matrices, and the effect of neighboring residues that showed an accuracy of 90.0% with a sensitivity and specificity of 94.9 and 80.6%, respectively, in 10-fold cross-validation. The method was evaluated with a test set of 20% data using 10 iterations, which showed an average accuracy of 88.9%. Furthermore, we systematically analyzed the features responsible for the better performance of our method and observed that neighboring residues play an important role in defining the disorder of a given residue in a protein sequence. We have developed a prediction server to identify disorder-related mutations, and it is available at http://www.iitm.ac.in/bioinfo/DIM_Pred/. Copyright © 2015 Elsevier Inc. All rights reserved.

  14. Amino acid sequence of bovine gamma E (IVa) lens crystallin.

    PubMed Central

    Kilby, G. W.; Sheil, M. M.; Shaw, D.; Harding, J. J.; Truscott, R. J.

    1997-01-01

    When electrospray ionization mass spectrometry (ESMS) was used to analyze purified bovine gamma E (gamma IVa)-crystallin, it yielded a relative molecular mass (M(r)) of 20.955 +/- 5. This mass is significantly different from that calculated from the published sequence (M(r) 20.894) (White HE et al., 1989, J Mol Biol 207:217-235). Further, ES-MS analysis of the protein after it had been reduced and carboxymethylated indicated the presence of five cysteine residues, whereas the published sequence contains six (Kilby GW et al., 1995, Eur Mass Spectrom 1:203-208). The entire protein sequence of gamma E crystallin has therefore been studied via a combination of ES-MS, ES-MS/MS, and Edman amino acid sequencing. The corrected sequence gives an M(r) of 20.955.3, which matches that obtained by ES-MS analysis of the purified native protein. The corrected sequence is also in agreement with a recent cDNA sequence obtained for a bovine gamma-crystallin by R. Hay (pers. comm.). PMID:9098901

  15. Amino acid sequence of bovine gamma E (IVa) lens crystallin.

    PubMed

    Kilby, G W; Sheil, M M; Shaw, D; Harding, J J; Truscott, R J

    1997-04-01

    When electrospray ionization mass spectrometry (ESMS) was used to analyze purified bovine gamma E (gamma IVa)-crystallin, it yielded a relative molecular mass (M(r)) of 20.955 +/- 5. This mass is significantly different from that calculated from the published sequence (M(r) 20.894) (White HE et al., 1989, J Mol Biol 207:217-235). Further, ES-MS analysis of the protein after it had been reduced and carboxymethylated indicated the presence of five cysteine residues, whereas the published sequence contains six (Kilby GW et al., 1995, Eur Mass Spectrom 1:203-208). The entire protein sequence of gamma E crystallin has therefore been studied via a combination of ES-MS, ES-MS/MS, and Edman amino acid sequencing. The corrected sequence gives an M(r) of 20.955.3, which matches that obtained by ES-MS analysis of the purified native protein. The corrected sequence is also in agreement with a recent cDNA sequence obtained for a bovine gamma-crystallin by R. Hay (pers. comm.).

  16. Amino acid sequence of bovine heart coupling factor 6.

    PubMed Central

    Fang, J K; Jacobs, J W; Kanner, B I; Racker, E; Bradshaw, R A

    1984-01-01

    The amino acid sequence of bovine heart mitochondrial coupling factor 6 (F6) has been determined by automated Edman degradation of the whole protein and derived peptides. Preparations based on heat precipitation and ethanol extraction showed allotypic variation at three positions while material further purified by HPLC yielded only one sequence that also differed by a Phe-Thr replacement at residue 62. The mature protein contains 76 amino acids with a calculated molecular weight of 9006 and a pI of approximately equal to 5, in good agreement with experimentally measured values. The charged amino acids are mainly clustered at the termini and in one section in the middle; these three polar segments are separated by two segments relatively rich in nonpolar residues. Chou-Fasman analysis suggests three stretches of alpha-helix coinciding (or within) the high-charge-density sequences with a single beta-turn at the first polar-nonpolar junction. Comparison of the F6 sequence with those of other proteins did not reveal any homologous structures. PMID:6149548

  17. Nucleotide and deduced amino acid sequences of Torpedo californica acetylcholine receptor gamma subunit.

    PubMed Central

    Claudio, T; Ballivet, M; Patrick, J; Heinemann, S

    1983-01-01

    The nucleotide sequence has been determined of a cDNA clone that codes for the 60,000-dalton gamma subunit of Torpedo californica acetylcholine receptor. The length of the cDNA clone is 2,010 base pairs. The 5' and 3' untranslated regions have respective lengths of 31 and 461 base pairs. Data suggest that the putative polyadenylylation consensus sequence A-A-T-A-A-A may not be required for polyadenylylation of the mRNA corresponding to the cDNA clone described in this study. From the DNA sequence data, the amino acid sequence of the gamma subunit was deduced. The subunit is composed of 489 amino acids giving a molecular mass of 56,600 daltons. The deduced amino acid sequence data also indicate the presence of a 17-amino acid extension or signal peptide on this subunit. From these data, structural predictions for the gamma subunit are made such as potential membrane-spanning regions, possible asparagine-linked glycosylation sites, and the assignment of regions of the protein to the extracellular, internal, and cytoplasmic domains of the lipid bilayer. Images PMID:6573658

  18. RNA-RNA interaction prediction based on multiple sequence alignments.

    PubMed

    Li, Andrew X; Marz, Manja; Qin, Jing; Reidys, Christian M

    2011-02-15

    Many computerized methods for RNA-RNA interaction structure prediction have been developed. Recently, O(N(6)) time and O(N(4)) space dynamic programming algorithms have become available that compute the partition function of RNA-RNA interaction complexes. However, few of these methods incorporate the knowledge concerning related sequences, thus relevant evolutionary information is often neglected from the structure determination. Therefore, it is of considerable practical interest to introduce a method taking into consideration both: thermodynamic stability as well as sequence/structure covariation. We present the a priori folding algorithm ripalign, whose input consists of two (given) multiple sequence alignments (MSA). ripalign outputs (i) the partition function, (ii) base pairing probabilities, (iii) hybrid probabilities and (iv) a set of Boltzmann-sampled suboptimal structures consisting of canonical joint structures that are compatible to the alignments. Compared to the single sequence-pair folding algorithm rip, ripalign requires negligible additional memory resource but offers much better sensitivity and specificity, once alignments of suitable quality are given. ripalign additionally allows to incorporate structure constraints as input parameters. The algorithm described here is implemented in C as part of the rip package.

  19. Constrained Multistate Sequence Design for Nucleic Acid Reaction Pathway Engineering.

    PubMed

    Wolfe, Brian R; Porubsky, Nicholas J; Zadeh, Joseph N; Dirks, Robert M; Pierce, Niles A

    2017-03-01

    We describe a framework for designing the sequences of multiple nucleic acid strands intended to hybridize in solution via a prescribed reaction pathway. Sequence design is formulated as a multistate optimization problem using a set of target test tubes to represent reactant, intermediate, and product states of the system, as well as to model crosstalk between components. Each target test tube contains a set of desired "on-target" complexes, each with a target secondary structure and target concentration, and a set of undesired "off-target" complexes, each with vanishing target concentration. Optimization of the equilibrium ensemble properties of the target test tubes implements both a positive design paradigm, explicitly designing for on-pathway elementary steps, and a negative design paradigm, explicitly designing against off-pathway crosstalk. Sequence design is performed subject to diverse user-specified sequence constraints including composition constraints, complementarity constraints, pattern prevention constraints, and biological constraints. Constrained multistate sequence design facilitates nucleic acid reaction pathway engineering for diverse applications in molecular programming and synthetic biology. Design jobs can be run online via the NUPACK web application.

  20. Improved therapy-success prediction with GSS estimated from clinical HIV-1 sequences.

    PubMed

    Pironti, Alejandro; Pfeifer, Nico; Kaiser, Rolf; Walter, Hauke; Lengauer, Thomas

    2014-01-01

    Rules-based HIV-1 drug-resistance interpretation (DRI) systems disregard many amino-acid positions of the drug's target protein. The aims of this study are (1) the development of a drug-resistance interpretation system that is based on HIV-1 sequences from clinical practice rather than hard-to-get phenotypes, and (2) the assessment of the benefit of taking all available amino-acid positions into account for DRI. A dataset containing 34,934 therapy-naïve and 30,520 drug-exposed HIV-1 pol sequences with treatment history was extracted from the EuResist database and the Los Alamos National Laboratory database. 2,550 therapy-change-episode baseline sequences (TCEB) were assigned to test set A. Test set B contains 1,084 TCEB from the HIVdb TCE repository. Sequences from patients absent in the test sets were used to train three linear support vector machines to produce scores that predict drug exposure pertaining to each of 20 antiretrovirals: the first one uses the full amino-acid sequences (DEfull), the second one only considers IAS drug-resistance positions (DEonlyIAS), and the third one disregards IAS drug-resistance positions (DEnoIAS). For performance comparison, test sets A and B were evaluated with DEfull, DEnoIAS, DEonlyIAS, geno2pheno[resistance], HIVdb, ANRS, HIV-GRADE, and REGA. Clinically-validated cut-offs were used to convert the continuous output of the first four methods into susceptible-intermediate-resistant (SIR) predictions. With each method, a genetic susceptibility score (GSS) was calculated for each therapy episode in each test set by converting the SIR prediction for its compounds to integer: S=2, I=1, and R=0. The GSS were used to predict therapy success as defined by the EuResist standard datum definition. Statistical significance was assessed using a Wilcoxon signed-rank test. A comparison of the therapy-success prediction performances among the different interpretation systems for test set A can be found in Table 1, while those for test set

  1. The complete amino acid sequence of chicken skeletal-muscle enolase.

    PubMed Central

    Russell, G A; Dunbar, B; Fothergill-Gilmore, L A

    1986-01-01

    The complete amino acid sequence of chicken skeletal-muscle enolase, comprising 433 residues, was determined. The sequence was deduced by automated sequencing of hydroxylamine-cleavage, CNBr-cleavage, o-iodosobenzoic acid-cleavage, clostripain-digest and staphylococcal-proteinase-digest fragments. The presence of several acid-labile peptide bonds and the tenacious aggregation of most CNBr-cleavage fragments meant that a commonly used sequencing strategy involving initial CNBr cleavage was unproductive. Cleavage at the single Asn-Gly peptide bond with hydroxylamine proved to be particularly useful. Comparison of the sequence of chicken enolase with the two yeast enolase isoenzyme sequences shows that the enzyme is strongly conserved, with 60% of the residues identical. The histidine and arginine residues implicated as being important for the activity of yeast enolase are conserved in the chicken enzyme. Secondary-structure predictions are analysed in an accompanying paper [Sawyer, Fothergill-Gilmore & Russell (1986) Biochem. J. 236, 127-130]. PMID:3539098

  2. Sequences Of Amino Acids For Human Serum Albumin

    NASA Technical Reports Server (NTRS)

    Carter, Daniel C.

    1992-01-01

    Sequences of amino acids defined for use in making polypeptides one-third to one-sixth as large as parent human serum albumin molecule. Smaller, chemically stable peptides have diverse applications including service as artificial human serum and as active components of biosensors and chromatographic matrices. In applications involving production of artificial sera from new sequences, little or no concern about viral contaminants. Smaller genetically engineered polypeptides more easily expressed and produced in large quantities, making commercial isolation and production more feasible and profitable.

  3. De novo structure prediction of globular proteins aided by sequence variation-derived contacts.

    PubMed

    Kosciolek, Tomasz; Jones, David T

    2014-01-01

    The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm--FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step.

  4. De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts

    PubMed Central

    Kosciolek, Tomasz; Jones, David T.

    2014-01-01

    The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm – FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step. PMID:24637808

  5. Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.

    PubMed

    Park, Byungkyu; Im, Jinyong; Tuvshinjargal, Narankhuu; Lee, Wook; Han, Kyungsook

    2014-11-01

    As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of

  6. Controls on sequence development and preservation offshore Namibia: Implications for sequence stratigraphic models and hydrocarbon prediction

    SciTech Connect

    Bagguley, J.G. ); Prosser, S. )

    1996-01-01

    Regional seismic interpretation of the passive margin offshore Namibia has enabled a sequence stratigraphic framework to be established for this previously under-studied region. Within this framework potential hydrocarbon plays, for example the location of source, seal and reservoir rocks can be pinpointed. The history of sequence stratigraphic models suggests that the passive margin offshore Namibia should provide an ideal setting for applying and testing sequence stratigraphic concepts. Results from this study however suggest that alongside the documented controls in sequence stratigraphy (i.e. tectonics, eustacy and sediment flux), additional factors act to influence sequence development and preservation along this margin. Detailed seismic interpretation of the post rift section of the Namibian margin has led to the identification of a member of erosional and depositional events; for example, charmers, canyons and slumps. Seismic facies analysis allows causative mechanisms to be inferred for the different geometries observed. In addition, the recognition of characteristic seismic facies enables reservoir and non-reservoir targets to be identified, thus aiding the prediction of potential hydrocarbon plays. Backstripping studies provide further information as to the evolution of the Namibian margin. For example, estimates can be made regarding changes in the rates of tectonics and sedimentation and the relative importance of these factors on the development of the margin can be assessed.

  7. Controls on sequence development and preservation offshore Namibia: Implications for sequence stratigraphic models and hydrocarbon prediction

    SciTech Connect

    Bagguley, J.G.; Prosser, S.

    1996-12-31

    Regional seismic interpretation of the passive margin offshore Namibia has enabled a sequence stratigraphic framework to be established for this previously under-studied region. Within this framework potential hydrocarbon plays, for example the location of source, seal and reservoir rocks can be pinpointed. The history of sequence stratigraphic models suggests that the passive margin offshore Namibia should provide an ideal setting for applying and testing sequence stratigraphic concepts. Results from this study however suggest that alongside the documented controls in sequence stratigraphy (i.e. tectonics, eustacy and sediment flux), additional factors act to influence sequence development and preservation along this margin. Detailed seismic interpretation of the post rift section of the Namibian margin has led to the identification of a member of erosional and depositional events; for example, charmers, canyons and slumps. Seismic facies analysis allows causative mechanisms to be inferred for the different geometries observed. In addition, the recognition of characteristic seismic facies enables reservoir and non-reservoir targets to be identified, thus aiding the prediction of potential hydrocarbon plays. Backstripping studies provide further information as to the evolution of the Namibian margin. For example, estimates can be made regarding changes in the rates of tectonics and sedimentation and the relative importance of these factors on the development of the margin can be assessed.

  8. Nanopores and nucleic acids: prospects for ultrarapid sequencing

    NASA Technical Reports Server (NTRS)

    Deamer, D. W.; Akeson, M.

    2000-01-01

    DNA and RNA molecules can be detected as they are driven through a nanopore by an applied electric field at rates ranging from several hundred microseconds to a few milliseconds per molecule. The nanopore can rapidly discriminate between pyrimidine and purine segments along a single-stranded nucleic acid molecule. Nanopore detection and characterization of single molecules represents a new method for directly reading information encoded in linear polymers. If single-nucleotide resolution can be achieved, it is possible that nucleic acid sequences can be determined at rates exceeding a thousand bases per second.

  9. Nanopores and nucleic acids: prospects for ultrarapid sequencing

    NASA Technical Reports Server (NTRS)

    Deamer, D. W.; Akeson, M.

    2000-01-01

    DNA and RNA molecules can be detected as they are driven through a nanopore by an applied electric field at rates ranging from several hundred microseconds to a few milliseconds per molecule. The nanopore can rapidly discriminate between pyrimidine and purine segments along a single-stranded nucleic acid molecule. Nanopore detection and characterization of single molecules represents a new method for directly reading information encoded in linear polymers. If single-nucleotide resolution can be achieved, it is possible that nucleic acid sequences can be determined at rates exceeding a thousand bases per second.

  10. Nanopore-based sequencing and detection of nucleic acids.

    PubMed

    Ying, Yi-Lun; Zhang, Junji; Gao, Rui; Long, Yi-Tao

    2013-12-09

    Nanopore-based techniques, which mimic the functions of natural ion channels, have attracted increasing attention as unique methods for single-molecule detection. The technology allows the real-time, selective, high-throughput analysis of nucleic acids through both biological and solid-state nanopores. In this Minireview, the background and latest progress in nanopore-based sequencing and detection of nucleic acids are summarized, and light is shed on a novel platform for nanopore-based detection. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  11. Synthetic oligonucleotide probes deduced from amino acid sequence data. Theoretical and practical considerations.

    PubMed

    Lathe, R

    1985-05-05

    Synthetic probes deduced from amino acid sequence data are widely used to detect cognate coding sequences in libraries of cloned DNA segments. The redundancy of the genetic code dictates that a choice must be made between (1) a mixture of probes reflecting all codon combinations, and (2) a single longer "optimal" probe. The second strategy is examined in detail. The frequency of sequences matching a given probe by chance alone can be determined and also the frequency of sequences closely resembling the probe and contributing to the hybridization background. Gene banks cannot be treated as random associations of the four nucleotides, and probe sequences deduced from amino acid sequence data occur more often than predicted by chance alone. Probe lengths must be increased to confer the necessary specificity. Examination of hybrids formed between unique homologous probes and their cognate targets reveals that short stretches of perfect homology occurring by chance make a significant contribution to the hybridization background. Statistical methods for improving homology are examined, taking human coding sequences as an example, and considerations of codon utilization and dinucleotide frequencies yield an overall homology of greater than 82%. Recommendations for probe design and hybridization are presented, and the choice between using multiple probes reflecting all codon possibilities and a unique optimal probe is discussed.

  12. Amino acid sequence of tyrosinase from Neurospora crassa.

    PubMed Central

    Lerch, K

    1978-01-01

    The amino-acid sequence of tyrosinase from Neurospora crassa (monophenol,dihydroxyphenylalanine:oxygen oxidoreductase, EC 1.14.18.1) is reported. This copper-containing oxidase consists of a single polypeptide chain of 407 amino acids. The primary structure was determined by automated and manual sequence analysis on fragments produced by cleavage with cyanogen bromide and on peptides obtained by digestion with trypsin, pepsin, thermolysin, or chymotrypsin. The amino terminus of the protein is acetylated and the single cysteinyl residue 96 is covalently linked via a thioether bridge to histidyl residue 94. The formation and the possible role of this unusual structure in Neurospora tyrosinase is discussed. Dye-sensitized photooxidation of apotyrosinase and active-site-directed inactivation of the native enzyme indicate the possible involvement of histidyl residues 188, 192, 289, and 305 or 306 as ligands to the active-site copper as well as in the catalytic mechanism of this monooxygenase. PMID:151279

  13. Unique sequences and predicted functions of myosins in Tetrahymena thermophila.

    PubMed

    Sugita, Maki; Iwataki, Yoshinori; Nakano, Kentaro; Numata, Osamu

    2011-07-01

    Myosins are eukaryotic actin-dependent molecular motors that play important roles in many cellular events. The function of each myosin is determined by a variety of functional domains in its tail region. In some major model organisms, the functions and properties of myosins have been investigated based on their amino acid sequences. However, in protists, myosins have been little studied beyond the level of genome sequences. We therefore investigated the mRNA expression levels and amino acid sequences of 13 myosin genes in the ciliate Tetrahymena thermophila. This study is an overview of myosins in T. thermophila, which has no typical myosins, such as class I, II, or V myosins. We showed that all 13 myosins were expressed in vegetative cells. Furthermore, these myosins could be divided into 3 subclasses based on four functional domains in their tail regions. Subclass 1 comprised of 8 myosins has both MyTH4 and FERM domains, and has a potential to function in vesicle transport or anchoring between membrane and actin filaments. Subclass 2 comprised of 4 myosins has RCC1 (regulator of chromosome condensation 1) domains, which are found only in some protists, and may have unconventional features. Subclass 3 is comprised of one myosin, which has a long coiled-coil domain like class II myosin. In addition, phylogenetic analysis on the basis of motor domains showed that T. thermophila myosins are separated into two clusters: one consists of subclasses 1 and 2, and the other consists of subclass 3. Copyright © 2011 Elsevier B.V. All rights reserved.

  14. Development of a sugar-binding residue prediction system from protein sequences using support vector machine.

    PubMed

    Banno, Masaki; Komiyama, Yusuke; Cao, Wei; Oku, Yuya; Ueki, Kokoro; Sumikoshi, Kazuya; Nakamura, Shugo; Terada, Tohru; Shimizu, Kentaro

    2017-02-01

    Several methods have been proposed for protein-sugar binding site prediction using machine learning algorithms. However, they are not effective to learn various properties of binding site residues caused by various interactions between proteins and sugars. In this study, we classified sugars into acidic and nonacidic sugars and showed that their binding sites have different amino acid occurrence frequencies. By using this result, we developed sugar-binding residue predictors dedicated to the two classes of sugars: an acid sugar binding predictor and a nonacidic sugar binding predictor. We also developed a combination predictor which combines the results of the two predictors. We showed that when a sugar is known to be an acidic sugar, the acidic sugar binding predictor achieves the best performance, and showed that when a sugar is known to be a nonacidic sugar or is not known to be either of the two classes, the combination predictor achieves the best performance. Our method uses only amino acid sequences for prediction. Support vector machine was used as a machine learning algorithm and the position-specific scoring matrix created by the position-specific iterative basic local alignment search tool was used as the feature vector. We evaluated the performance of the predictors using five-fold cross-validation. We have launched our system, as an open source freeware tool on the GitHub repository (https://doi.org/10.5281/zenodo.61513).

  15. Complete amino acid sequence of three reptile lysozymes.

    PubMed

    Ponkham, Pornpimol; Daduang, Sakda; Kitimasak, Wachira; Krittanai, Chartchai; Chokchaichamnankit, Daranee; Srisomsap, Chantragan; Svasti, Jisnuson; Kawamura, Shunsuke; Araki, Tomohiro; Thammasirirak, Sompong

    2010-01-01

    To study the structure and function of reptile lysozymes, we have reported their purification, and in this study we have established the amino acid sequence of three egg white lysozymes in soft-shelled turtle eggs (SSTL A and SSTL B from Trionyx sinensis, ASTL from Amyda cartilaginea) by using the rapid peptide mapping method. The established amino acid sequence of SSTL A, SSTL B, and ASTL showed substitutions of 43, 42, and 44 residues respectively when compared with the HEWL (hen egg white lysozyme) sequence. In these reptile lysozymes, SSTL A had one substitution compared with SSTL B (Gly126Asp) and had an N-terminal extra Gly and 11 substitutions compared with ASTL. SSTL B had an N-terminal extra Gly and 10 residues different from ASTL. The sequence of SSTL B was identical to soft-shelled turtle lysozyme from STL (Trionyx sinensis japonicus). The Ile residue at position 93 of ASTL is the first report in all C-type lysozymes. Furthermore, amino acid substitutions (Phe34His, Arg45Tyr, Thr47Arg, and Arg114Tyr) were also found at subsites E and F when compared with HEWL. The time course using N-acetylglucosamine pentamer as a substrate exhibited a reduction of the rate constant of glycosidic cleavage and increase of binding free energy for subsites E and F, which proved the contribution for amino acids mentioned above for substrate binding at subsites E and F. Interestingly, the variable binding free energy values occurred on ASTL, may be contributed from substitutions at outside of subsites E and F.

  16. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder.

    PubMed

    Lorenzo, J Ramiro; Alonso, Leonardo G; Sánchez, Ignacio E

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage "Protein and nucleic acid structure and sequence analysis".

  17. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder

    PubMed Central

    Lorenzo, J. Ramiro; Alonso, Leonardo G.; Sánchez, Ignacio E.

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage “Protein and nucleic acid structure and sequence analysis”. PMID:26674530

  18. Amino-acid sequence of toxin I from Anemonia sulcata.

    PubMed

    Wunderer, G; Eulitz, M

    1978-08-15

    Toxin I from Anemonia sulcata, a major component of the sea anemone venom, consists of 46 amino acid residues which are linked by three disulfide bridges. The [14C]carboxymethylated polypeptide was sequenced to position 29 by automated Edman degradation. The remaining sequence was determined from cyanogen bromide peptides and from tryptic peptides of the citraconylated [14C]carboxymethylated toxin. Toxin I is homologous to toxin II from Anemonia sulcata and to anthopleurin A, a toxin from the sea anemone Anthopleura xanthogrammica. These toxins constitute a new class of polypeptide toxins. No significant homologies exist with toxin III from Anemonia sulcata nor with known sequences of neurotoxins or cardiotoxins of various origin.

  19. Prediction of protein function improving sequence remote alignment search by a fuzzy logic algorithm.

    PubMed

    Gómez, Antonio; Cedano, Juan; Espadaler, Jordi; Hermoso, Antonio; Piñol, Jaume; Querol, Enrique

    2008-02-01

    The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence.

  20. Sequence-specific thermodynamic properties of nucleic acids influence both transcriptional pausing and backtracking in yeast

    PubMed Central

    2017-01-01

    RNA Polymerase II pauses and backtracks during transcription, with many consequences for gene expression and cellular physiology. Here, we show that the energy required to melt double-stranded nucleic acids in the transcription bubble predicts pausing in Saccharomyces cerevisiae far more accurately than nucleosome roadblocks do. In addition, the same energy difference also determines when the RNA polymerase backtracks instead of continuing to move forward. This data-driven model corroborates—in a genome wide and quantitative manner—previous evidence that sequence-dependent thermodynamic features of nucleic acids influence both transcriptional pausing and backtracking. PMID:28301878

  1. Using next generation transcriptome sequencing to predict an ectomycorrhizal metablome.

    SciTech Connect

    Larsen, P. E.; Sreedasyam, A.; Trivedi, G; Podila, G. K.; Cseke, L. J.; Collart, F. R.

    2011-05-13

    Mycorrhizae, symbiotic interactions between soil fungi and tree roots, are ubiquitous in terrestrial ecosystems. The fungi contribute phosphorous, nitrogen and mobilized nutrients from organic matter in the soil and in return the fungus receives photosynthetically-derived carbohydrates. This union of plant and fungal metabolisms is the mycorrhizal metabolome. Understanding this symbiotic relationship at a molecular level provides important contributions to the understanding of forest ecosystems and global carbon cycling. We generated next generation short-read transcriptomic sequencing data from fully-formed ectomycorrhizae between Laccaria bicolor and aspen (Populus tremuloides) roots. The transcriptomic data was used to identify statistically significantly expressed gene models using a bootstrap-style approach, and these expressed genes were mapped to specific metabolic pathways. Integration of expressed genes that code for metabolic enzymes and the set of expressed membrane transporters generates a predictive model of the ectomycorrhizal metabolome. The generated model of mycorrhizal metabolome predicts that the specific compounds glycine, glutamate, and allantoin are synthesized by L. bicolor and that these compounds or their metabolites may be used for the benefit of aspen in exchange for the photosynthetically-derived sugars fructose and glucose. The analysis illustrates an approach to generate testable biological hypotheses to investigate the complex molecular interactions that drive ectomycorrhizal symbiosis. These models are consistent with experimental environmental data and provide insight into the molecular exchange processes for organisms in this complex ecosystem. The method used here for predicting metabolomic models of mycorrhizal systems from deep RNA sequencing data can be generalized and is broadly applicable to transcriptomic data derived from complex systems.

  2. Using next generation transcriptome sequencing to predict an ectomycorrhizal metabolome

    PubMed Central

    2011-01-01

    Background Mycorrhizae, symbiotic interactions between soil fungi and tree roots, are ubiquitous in terrestrial ecosystems. The fungi contribute phosphorous, nitrogen and mobilized nutrients from organic matter in the soil and in return the fungus receives photosynthetically-derived carbohydrates. This union of plant and fungal metabolisms is the mycorrhizal metabolome. Understanding this symbiotic relationship at a molecular level provides important contributions to the understanding of forest ecosystems and global carbon cycling. Results We generated next generation short-read transcriptomic sequencing data from fully-formed ectomycorrhizae between Laccaria bicolor and aspen (Populus tremuloides) roots. The transcriptomic data was used to identify statistically significantly expressed gene models using a bootstrap-style approach, and these expressed genes were mapped to specific metabolic pathways. Integration of expressed genes that code for metabolic enzymes and the set of expressed membrane transporters generates a predictive model of the ectomycorrhizal metabolome. The generated model of mycorrhizal metabolome predicts that the specific compounds glycine, glutamate, and allantoin are synthesized by L. bicolor and that these compounds or their metabolites may be used for the benefit of aspen in exchange for the photosynthetically-derived sugars fructose and glucose. Conclusions The analysis illustrates an approach to generate testable biological hypotheses to investigate the complex molecular interactions that drive ectomycorrhizal symbiosis. These models are consistent with experimental environmental data and provide insight into the molecular exchange processes for organisms in this complex ecosystem. The method used here for predicting metabolomic models of mycorrhizal systems from deep RNA sequencing data can be generalized and is broadly applicable to transcriptomic data derived from complex systems. PMID:21569493

  3. Quantum-Sequencing: Biophysics of quantum tunneling through nucleic acids

    NASA Astrophysics Data System (ADS)

    Casamada Ribot, Josep; Chatterjee, Anushree; Nagpal, Prashant

    2014-03-01

    Tunneling microscopy and spectroscopy has extensively been used in physical surface sciences to study quantum tunneling to measure electronic local density of states of nanomaterials and to characterize adsorbed species. Quantum-Sequencing (Q-Seq) is a new method based on tunneling microscopy for electronic sequencing of single molecule of nucleic acids. A major goal of third-generation sequencing technologies is to develop a fast, reliable, enzyme-free single-molecule sequencing method. Here, we present the unique ``electronic fingerprints'' for all nucleotides on DNA and RNA using Q-Seq along their intrinsic biophysical parameters. We have analyzed tunneling spectra for the nucleotides at different pH conditions and analyzed the HOMO, LUMO and energy gap for all of them. In addition we show a number of biophysical parameters to further characterize all nucleobases (electron and hole transition voltage and energy barriers). These results highlight the robustness of Q-Seq as a technique for next-generation sequencing.

  4. The complementary deoxyribonucleic acid sequence of guinea pig endometrial prorelaxin.

    PubMed

    Lee, Y A; Bryant-Greenwood, G D; Mandel, M; Greenwood, F C

    1992-03-01

    The nucleotide sequence of the relaxin gene transcript in the endometrium of the late pregnant guinea pig has been determined. The strategy used was a combination of polymerase chain reaction (PCR) with primers designed from the mRNA sequence of porcine preprorelaxin, rapid amplification of cDNA ends-PCR, and blunt end cloning in M13 mp18. With heterologous primers, a 226-basepair (bp) segment of the guinea pig relaxin gene sequence was obtained and was used to design a guinea pig-specific primer for use with the rapid amplification of cDNA ends-PCR method. The latter allowed completion of the sequence of 336 bp, with a 96-bp overlap. The sequence obtained shows greater homology at both the nucleotide and amino acid levels with porcine and human relaxins H1 and H2 than with rat relaxin, supporting the thesis that the guinea pig is not a rodent. The transcription of the guinea pig endometrial relaxin gene during pregnancy was confirmed by Northern analysis of guinea pig endometrial tissues with a species-specific cDNA probe. The endometrial relaxin gene is transcribed during pregnancy, but not in lactation, consistent with the observed immunostaining for relaxin.

  5. Improving HIV coreceptor usage prediction in the clinic using hints from next-generation sequencing data

    PubMed Central

    Pfeifer, Nico; Lengauer, Thomas

    2012-01-01

    Motivation: Due to the high mutation rate of human immunodeficiency virus (HIV), drug-resistant-variants emerge frequently. Therefore, researchers are constantly searching for new ways to attack the virus. One new class of anti-HIV drugs is the class of coreceptor antagonists that block cell entry by occupying a coreceptor on CD4 cells. This type of drug just has an effect on the subset of HIVs that use the inhibited coreceptor. A good prediction of whether the viral population inside a patient is susceptible to the treatment is hence very important for therapy decisions and pre-requisite to administering the respective drug. The first prediction models were based on data from Sanger sequencing of the V3 loop of HIV. Recently, a method based on next-generation sequencing (NGS) data was introduced that predicts labels for each read separately and decides on the patient label through a percentage threshold for the resistant viral minority. Results: We model the prediction problem on the patient level taking the information of all reads from NGS data jointly into account. This enables us to improve prediction performance for NGS data, but we can also use the trained model to improve predictions based on Sanger sequencing data. Therefore, also laboratories without NGS capabilities can benefit from the improvements. Furthermore, we show which amino acids at which position are important for prediction success, giving clues on how the interaction mechanism between the V3 loop and the particular coreceptors might be influenced. Availability: A webserver is available at http://coreceptor.bioinf.mpi-inf.mpg.de. Contact: nico.pfeifer@mpi-inf.mpg.de PMID:22962486

  6. TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences

    PubMed Central

    Song, Jiangning; Tan, Hao; Wang, Mingjun; Webb, Geoffrey I.; Akutsu, Tatsuya

    2012-01-01

    Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/. PMID:22319565

  7. Interrogating noise in protein sequences from the perspective of protein-protein interactions prediction.

    PubMed

    Wang, Yongcui; Ren, Xianwen; Zhang, Chunhua; Deng, Naiyang; Zhang, Xiangsun

    2012-12-21

    The past decades witnessed extensive efforts to study the relationship among proteins. Particularly, sequence-based protein-protein interactions (PPIs) prediction is fundamentally important in speeding up the process of mapping interactomes of organisms. High-throughput experimental methodologies make many model organism's PPIs known, which allows us to apply machine learning methods to learn understandable rules from the available PPIs. Under the machine learning framework, the composition vectors are usually applied to encode proteins as real-value vectors. However, the composition vector value might be highly correlated to the distribution of amino acids, i.e., amino acids which are frequently observed in nature tend to have a large value of composition vectors. Thus formulation to estimate the noise induced by the background distribution of amino acids may be needed during representations. Here, we introduce two kinds of denoising composition vectors, which were successfully used in construction of phylogenetic trees, to eliminate the noise. When validating these two denoising composition vectors on Escherichia coli (E. coli), Saccharomyces cerevisiae (S. cerevisiae) and human PPIs datasets, surprisingly, the predictive performance is not improved, and even worse than non-denoised prediction. These results suggest that the noise in phylogenetic tree construction may be valuable information in PPIs prediction.

  8. Molecular cloning and amino acid sequence of human 5-lipoxygenase

    SciTech Connect

    Matsumoto, T.; Funk, C.D.; Radmark, O.; Hoeoeg, J.O.; Joernvall, H.; Samuelsson, B.

    1988-01-01

    5-Lipoxygenase (EC 1.13.11.34), a Ca/sup 2 +/- and ATP-requiring enzyme, catalyzes the first two steps in the biosynthesis of the peptidoleukotrienes and the chemotactic factor leukotriene B/sub 4/. A cDNA clone corresponding to 5-lipoxygenase was isolated from a human lung lambda gt11 expression library by immunoscreening with a polyclonal antibody. Additional clones from a human placenta lambda gt11 cDNA library were obtained by plaque hybridization with the /sup 32/P-labeled lung cDNA clone. Sequence data obtained from several overlapping clones indicate that the composite DNAs contain the complete coding region for the enzyme. From the deduced primary structure, 5-lipoxygenase encodes a 673 amino acid protein with a calculated molecular weight of 77,839. Direct analysis of the native protein and its proteolytic fragments confirmed the deduced composition, the amino-terminal amino acid sequence, and the structure of many internal segments. 5-Lipoxygenase has no apparent sequence homology with leukotriene A/sub 4/ hydrolase or Ca/sup 2 +/-binding proteins. RNA blot analysis indicated substantial amounts of an mRNA species of approx. = 2700 nucleotides in leukocytes, lung, and placenta.

  9. Nucleic acid sequence detection using multiplexed oligonucleotide PCR

    DOEpatents

    Nolan, John P.; White, P. Scott

    2006-12-26

    Methods for rapidly detecting single or multiple sequence alleles in a sample nucleic acid are described. Provided are all of the oligonucleotide pairs capable of annealing specifically to a target allele and discriminating among possible sequences thereof, and ligating to each other to form an oligonucleotide complex when a particular sequence feature is present (or, alternatively, absent) in the sample nucleic acid. The design of each oligonucleotide pair permits the subsequent high-level PCR amplification of a specific amplicon when the oligonucleotide complex is formed, but not when the oligonucleotide complex is not formed. The presence or absence of the specific amplicon is used to detect the allele. Detection of the specific amplicon may be achieved using a variety of methods well known in the art, including without limitation, oligonucleotide capture onto DNA chips or microarrays, oligonucleotide capture onto beads or microspheres, electrophoresis, and mass spectrometry. Various labels and address-capture tags may be employed in the amplicon detection step of multiplexed assays, as further described herein.

  10. Predicting folic acid intake among college students.

    PubMed

    Lane, Susan H; Hines, Annette; Krowchuk, Heidi

    2015-01-01

    Annually in the United States, approximately 3,000 babies are born with neural tube defects (NTDs). Folic acid supplementation can reduce NTDs by 50% to 70%. Despite recommendations for folic acid intake, only 30% of women ages 18 to 24 report folic acid supplementation and 6% have knowledge of when to take folic acid. There is little information regarding lifestyle factors that correlate with consuming folic acid. The purpose was to describe folic acid consumption among college students; and explore the relationship between folic acid intake and the variables of: age, gender, year in college, alcohol and tobacco use, and vitamin supplement intake. This was a descriptive study with secondary analysis of data from 1,921 college-aged student participants in North Carolina who took part in a pretest/posttest-designed intervention to increase folic acid consumption and knowledge. Surveys included demographic, lifestyle, folic acid knowledge, and consumption questions adapted from the Centers for Disease Control and Prevention questionnaire. Quantitative analyses included descriptive statistics and logistic regression. Of the 1,921 college students, 83.3% reported taking a vitamin supplement, but only 47.6% stated that the vitamin contained folic acid. A relationship was found between age, year in school, gender, and vitamin intake. Lifestyle variables were not significant predictors of folic acid consumption. Identification of variables associated with folic acid intake, marketing, and education can be focused to increase supplementation levels, and ultimately reduce the number of NTDs.

  11. The amino acid sequence of chymopapain from Carica papaya.

    PubMed Central

    Watson, D C; Yaguchi, M; Lynn, K R

    1990-01-01

    Chymopapain is a polypeptide of 218 amino acid residues. It has considerable structural similarity with papain and papaya proteinase omega, including conservation of the catalytic site and of the disulphide bonding. Chymopapain is like papaya proteinase omega in carrying four extra residues between papain positions 168 and 169, but differs from both papaya proteinases in the composition of its S2 subsite, as well as in having a second thiol group, Cys-117. Some evidence for the amino acid sequence of chymopapain has been deposited as Supplementary Publication SUP 50153 (12 pages) at the British Library Document Supply Centre, Boston Spa., Wetherby, West Yorkshire LS23 7BQ, U.K., from whom copies may be obtained on the terms indicated in Biochem. J. (1990) 265, 5. The information comprises Supplement Tables 1-4, which contain, in order, amino acid compositions of peptides from tryptic, peptic, CNBr and mild acid cleavages, Supplement Fig. 1, showing re-fractionation of selected peaks from Fig. 2 of the main paper. Supplement Fig. 2, showing cation-exchange chromatography of the earliest-eluted peak of Fig. 3 of the main paper, Supplement Fig. 3, showing reverse-phase h.p.l.c. of the later-eluted peak from Fig. 3 of the main paper, and Supplement Fig. 4, showing the separation of peptides after mild acid hydrolysis of CNBr-cleavage fragment CB3. PMID:2106878

  12. Prediction of substrate specificity in NS3/4A serine protease by biased sequence search threading.

    PubMed

    Ozdemir Isik, Gonca; Ozer, A Nevra

    2017-04-01

    Proteases recognize specific substrate sequences and catalyze the hydrolysis of targeted peptide bonds to activate or degrade them. It is particularly important to identify the recognition and binding mechanisms of protease-substrate complex structures in studies of drug development. Cleavage specificity in protease systems is generally determined by the amino acid profile, structural features, and distinct molecular interactions. In this work, substrate variability and substrate specificity of the NS3/4A serine protease encoded by the hepatitis C virus (HCV) was investigated by the biased sequence search threading (BSST) methodology. The available crystal structures of peptide-bound protease were used as templates as well as new complex structures that were generated via docking calculations. Threading various binding and nonbinding sequences as starting sequences over multiple templates, the potential sequence space was efficiently explored by a low-resolution knowledge-based scoring potential. The low-energy substrate sequences generated by the biased search are correlated with the natural substrates with conserved amino acid preferences, although some positions exhibit variability. Specifically, the amino acids which play essential roles in cleavage are mostly preferred. Potential substrate sequences were predicted by statistical probability approaches that consider the pairwise and triplewise interdependencies among residue positions in the low-energy sequences. The predicted substrate sequences also reproduce most of the natural substrate sequences, implying the complex interdependence between the different substrate residues. Consequently, the BSST seems to provide a powerful methodology for predicting the substrate specificity for the NS3/4A protease, which is a target in drug discovery studies for HCV.

  13. A machine-learning approach for predicting palmitoylation sites from integrated sequence-based features.

    PubMed

    Li, Liqi; Luo, Qifa; Xiao, Weidong; Li, Jinhui; Zhou, Shiwen; Li, Yongsheng; Zheng, Xiaoqi; Yang, Hua

    2017-02-01

    Palmitoylation is the covalent attachment of lipids to amino acid residues in proteins. As an important form of protein posttranslational modification, it increases the hydrophobicity of proteins, which contributes to the protein transportation, organelle localization, and functions, therefore plays an important role in a variety of cell biological processes. Identification of palmitoylation sites is necessary for understanding protein-protein interaction, protein stability, and activity. Since conventional experimental techniques to determine palmitoylation sites in proteins are both labor intensive and costly, a fast and accurate computational approach to predict palmitoylation sites from protein sequences is in urgent need. In this study, a support vector machine (SVM)-based method was proposed through integrating PSI-BLAST profile, physicochemical properties, [Formula: see text]-mer amino acid compositions (AACs), and [Formula: see text]-mer pseudo AACs into the principal feature vector. A recursive feature selection scheme was subsequently implemented to single out the most discriminative features. Finally, an SVM method was implemented to predict palmitoylation sites in proteins based on the optimal features. The proposed method achieved an accuracy of 99.41% and Matthews Correlation Coefficient of 0.9773 for a benchmark dataset. The result indicates the efficiency and accuracy of our method in prediction of palmitoylation sites based on protein sequences.

  14. MicroRNA target prediction using thermodynamic and sequence curves.

    PubMed

    Ghoshal, Asish; Shankar, Raghavendran; Bagchi, Saurabh; Grama, Ananth; Chaterji, Somali

    2015-11-25

    MicroRNAs (miRNAs) are small regulatory RNA that mediate RNA interference by binding to various mRNA target regions. There have been several computational methods for the identification of target mRNAs for miRNAs. However, these have considered all contributory features as scalar representations, primarily, as thermodynamic or sequence-based features. Further, a majority of these methods solely target canonical sites, which are sites with "seed" complementarity. Here, we present a machine-learning classification scheme, titled Avishkar, which captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves, separately for various input features, such as thermodynamic and sequence features. Further, we use a principled approach to uniformly model canonical and non-canonical seed matches, using a novel seed enrichment metric. We demonstrate that large number of seed-match patterns have high enrichment values, conserved across species, and that majority of miRNA binding sites involve non-canonical matches, corroborating recent findings. Using spatial curves and popular categorical features, such as target site length and location, we train a linear SVM model, utilizing experimental CLIP-seq data. Our model significantly outperforms all established methods, for both canonical and non-canonical sites. We achieve this while using a much larger candidate miRNA-mRNA interaction set than prior work. We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves, specifically about 20% better than the state-of-the-art, for different species (human or mouse), or different target types (canonical or non-canonical). To the best of our knowledge we provide the first distributed framework for microRNA target prediction based on Apache Hadoop and Spark. All source code and data is publicly available at https://bitbucket.org/cellsandmachines/avishkar.

  15. Optimal coding of vectorcardiographic sequences using spatial prediction.

    PubMed

    Augustyniak, Piotr

    2007-05-01

    This paper discusses principles, implementation details, and advantages of sequence coding algorithm applied to the compression of vectocardiograms (VCG). The main novelty of the proposed method is the automatic management of distortion distribution controlled by the local signal contents in both technical and medical aspects. As in clinical practice, the VCG loops representing P, QRS, and T waves in the three-dimensional (3-D) space are considered here as three simultaneous sequences of objects. Because of the similarity of neighboring loops, encoding the values of prediction error significantly reduces the data set volume. The residual values are de-correlated with the discrete cosine transform (DCT) and truncated at certain energy threshold. The presented method is based on the irregular temporal distribution of medical data in the signal and takes advantage of variable sampling frequency for automatically detected VCG loops. The features of the proposed algorithm are confirmed by the results of the numerical experiment carried out for a wide range of real records. The average data reduction ratio reaches a value of 8.15 while the percent root-mean-square difference (PRD) distortion ratio for the most important sections of signal does not exceed 1.1%.

  16. The amino acid sequence of rabbit cardiac troponin I.

    PubMed Central

    Grand, R J; Wilkinson, J M

    1976-01-01

    The complete amino acid sequence of troponin I from rabbit cardiac muscle was determined by the isolation of four unique CNBr fragments, together with overlapping tryptic peptides containing radioactive methionine residues. Overlap data for residues 35-36, 93-94 and 140-145 are incomplete, the sequence at these positions being based on homology with the sequence of the fast-skeletal-muscle protein. Cardiac troponin I is a single polypeptide chain of 206 residues with mol.wt. 23550 and an extinction coefficient, E 1%,1cm/280, of 4.37. The protein has a net positive charge of 14 and is thus somewhat more basic than troponin I from fast-skeletal muscle. Comparison of the sequences of troponin I from cardiac and fast skeletal muscle show that the cardiac protein has 26 extra residues at the N-terminus which account for the larger size of the protein. In the remainder of sequence there is a considerable degree of homology, this being greater in the C-terminal two-thirds of the molecule. The region in the cardiac protein corresponding to the peptide with inhibitory activity from the fast-skeletal-muscle protein is very similar and it seems unlikely that this is the cause of the difference in inhibitory activity between the two proteins. The region responsible for binding troponin C, however, possesses a lower degree of homology. Detailed evidence on which the sequence is based has been deposited as Supplementary Publication SUP 50072 (20 pages), at the British Library Lending Division, Boston Spa, Wetherby, West Yorkshire LS23 7QB, U.K., from whom copies may be obtained on the terms given in Biochem. J. (1976) 153, 5. PMID:1008822

  17. Nucleotide and deduced amino acid sequences of rat myosin binding protein H (MyBP-H).

    PubMed

    Jung, J; Oh, J; Lee, K

    1998-12-01

    The complete nucleotide sequence of the cDNA clone encoding rat skeletal muscle myosin-binding protein H (MyBP-H) was determined and amino acid sequence was deduced from the nucleotide sequence (GenBank accession number AF077338). The full-length cDNA of 1782 base pairs(bp) contains a single open reading frame of 1454 bp encoding a rat MyBP-H protein of the predicted molecular mass 52.7 kDa and includes the common consensus 'CA__TG' protein binding motif. The cDNA sequence of rat MyBP-H show 92%, 84% and 41% homology with those of mouse, human and chicken, respectively. The protein contains tandem internal motifs array (-FN III-Ig C2-FN III-Ig C2-) in the C-terminal region which resembles to the immunoglobulin superfamily C2 and fibronectin type III motifs. The amino acid sequence of the C-terminal Ig C2 was highly conserved among MyBPs family and other thick filament binding proteins, suggesting that the C-terminal Ig C2 might play an important role in its function. All proteins belonging to MyBP-H member contains 'RKPS' sequence which is assumed to be cAMP- and cGMP-dependent protein kinase A phosphorylation site. Computer analysis of the primary sequence of rat MyBP-H predicted 11 protein kinase C (PKC) phosphorylation site, 7 casein kinase II (CK2) phosphorylation site and 4 N-myristoylation site.

  18. Amino acid sequence of a mouse immunoglobulin mu chain.

    PubMed Central

    Kehry, M; Sibley, C; Fuhrman, J; Schilling, J; Hood, L E

    1979-01-01

    The complete amino acid sequence of the mouse mu chain from the BALB/c myeloma tumor MOPC 104E is reported. The C mu region contains four consecutive homology regions of approximately 110 residues and a COOH-terminal region of 19 residues. A comparison of this mu chain from mouse with a complete mu sequence from human (Ou) and a partial mu chain sequence from dog (Moo) reveals a striking gradient of increasing homology from the NH2-terminal to the COOH-terminal portion of these mu chains, with the former being the least and the latter the most highly conserved. Four of the five sites of carbohydrate attachment appear to be at identical residue positions when the constant regions of the mouse and human mu chains are compared. The mu chain of MOPC 104E has a carbohydrate moiety attached in the second hypervariable region. This is particularly interesting in view of the fact that MOPC 104E binds alpha-(1 leads to 3)-dextran, a simple carbohydrate. The structural and functional constraints imposed by these comparative sequence analyses are discussed. PMID:111247

  19. Bacteriorhodopsin: partial sequence of mRNA provides amino acid sequence in the precursor region.

    PubMed Central

    Chang, S H; Majumdar, A; Dunn, R; Makabe, O; RajBhandary, U L; Khorana, H G; Ohtsuka, E; Tanaka, T; Taniyama, Y O; Ikehara, M

    1981-01-01

    mRNA for bacteriorhodopsin from Halobacterium halobium has been partially purified. By using this mRNA as template in the presence of reverse transcriptase RNA-dependent DNA nucleotidyltransferase and a 5'-[32P] synthetic oligodeoxyribonucleotide corresponding to amino acids 9-12 of bacteriorhodopsin as primer, we have isolated the major 5'-[32P]cDNA product, approximately 80 nucleotides long, and determined its sequence. Based on the cDNA sequence, the 5'-proximal sequence of bacteriorhodopsin mRNA is G-C-A-U-G-U-U-G-G-A-G-U-U-A-U-U-G-C-C-A-A-C-A-G-C-A-G-U-G-G-A-G-G-G-G-G-U-A-U-C -G-C-A-G-G-C-C-C-A-G-A-U-C-A-C-C-G-G-A-C-G-U-C-C-G. This includes the expected sequence for amino acids 1-8 and shows that bacteriorhodopsin is synthesized as a precursor that is at least 13 amino acids longer (Met-Leu-Glu-Leu-Leu-Pro-Thr-Ala-Val-Glu-Gly-Val-Ser) at the NH2 terminus. Agarose/urea gel electrophoresis of the partially purified mRNA showed several bands; of these, a major one hybridized with 5'-[32P]cDNA. These results suggest that the bacteriorhodopsin mRNA in the partially purified preparation is homogeneous in size and that it constitutes a substantial portion of the RNA preparation subjected to electrophoresis. Images PMID:6943548

  20. Relationship between peptide amino acid sequence and membrane curvature generation

    NASA Astrophysics Data System (ADS)

    Schmidt, Nathan; Kuo, David; Hwee Lai, Ghee; Mishra, Abhijit; Wong, Gerard

    2012-02-01

    Amphipathic peptides and amphipathic domains in proteins can perturb and restructure biological membranes. For example, it is believed that the cationic, amphipathic motif found in membrane active antimicrobial peptides (AMPs) is responsible for their membrane disruption mechanisms of action. And ApoA-I, the main apolipoprotein in high density lipoprotein contains a series of amphipathic α-helical repeats which are responsible for its lipid associating properties. We use small angle x-ray scattering (SAXS) to investigate the interaction of model cell membranes with prototypical AMPs and consensus peptides derived from the helical structural motif of ApoA-I. The relationship between peptide sequence and the peptide-induced changes in membrane curvature and topology is examined. By comparing the membrane rearrangement and corresponding phase behavior induced by these two distinct classes of membrane restructuring peptides we will discuss the role of amino acid sequence on membrane curvature generation.

  1. Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12.

    PubMed

    Thieffry, D; Salgado, H; Huerta, A M; Collado-Vides, J

    1998-06-01

    As one of the best-characterized free-living organisms, Escherichia coli and its recently completed genomic sequence offer a special opportunity to exploit systematically the variety of regulatory data available in the literature in order to make a comprehensive set of regulatory predictions in the whole genome. The complete genome sequence of E.coli was analyzed for the binding of transcriptional regulators upstream of coding sequences. The biological information contained in RegulonDB (Huerta, A.M. et al., Nucleic Acids Res.,26,55-60, 1998) for 56 different transcriptional proteins was the support to implement a stringent strategy combining string search and weight matrices. We estimate that our search included representatives of 15-25% of the total number of regulatory binding proteins in E.coli. This search was performed on the set of 4288 putative regulatory regions, each 450 bp long. Within the regions with predicted sites, 89% are regulated by one protein and 81% involve only one site. These numbers are reasonably consistent with the distribution of experimental regulatory sites. Regulatory sites are found in 603 regions corresponding to 16% of operon regions and 10% of intra-operonic regions. Additional evidence gives stronger support to some of these predictions, including the position of the site, biological consistency with the function of the downstream gene, as well as genetic evidence for the regulatory interaction. The predictions described here were incorporated into the map presented in the paper describing the complete E.coli genome (Blattner,F.R. et al., Science, 277, 1453-1461, 1997). The complete set of predictions in GenBank format is available at the url: http://www. cifn.unam.mx/Computational_Biology/E.coli-predictions ecoli-reg@cifn.unam.mx, collado@cifn.unam.mx

  2. Nucleotide sequences of the Pseudomonas savastanoi indoleacetic acid genes show homology with Agrobacterium tumefaciens T-DNA

    PubMed Central

    Yamada, Tetsuji; Palm, Curtis J.; Brooks, Bob; Kosuge, Tsune

    1985-01-01

    We report the nucleotide sequences of iaaM and iaaH, the genetic determinants for, respectively, tryptophan 2-monooxygenase and indoleacetamide hydrolase, the enzymes that catalyze the conversion of L-tryptophan to indoleacetic acid in the tumor-forming bacterium Pseudomonas syringae pv. savastanoi. The sequence analysis indicates that the iaaM locus contains an open reading frame encoding 557 amino acids that would comprise a protein with a molecular weight of 61,783; the iaaH locus contains an open reading frame of 455 amino acids that would comprise a protein with a molecular weight of 48,515. Significant amino acid sequence homology was found between the predicted sequence of the tryptophan monooxygenase of P. savastanoi and the deduced product of the T-DNA tms-1 gene of the octopine-type plasmid pTiA6NC from Agrobacterium tumefaciens. Strong homology was found in the 25 amino acid sequence in the putative FAD-binding region of tryptophan monooxygenase. Homology was also found in the amino acid sequences representing the central regions of the putative products of iaaH and tms-2 T-DNA. The results suggest a strong similarity in the pathways for indoleacetic acid synthesis encoded by genes in P. savastanoi and in A. tumefaciens T-DNA. Images PMID:16593610

  3. Genetic Prediction in the Genetic Analysis Workshop 18 Sequencing Data

    PubMed Central

    Ziegler, Andreas; Bohossian, Nora; Diego, Vincent P.; Yao, Chen

    2015-01-01

    High-throughput sequencing data can be used to predict phenotypes from genotypes, and this corresponds to establishing a prognostic model. In extended pedigrees the relatedness of subjects provides additional information so that genetic values, fixed or random genetic components, and heritability can be estimated. At the Genetic Analysis Workshop 18 the working group on genetic prediction dealt with both establishing a prognostic model and, in one contribution, comparing standard logistic regression with robust logistic regression in a sample of unrelated affected or unaffected individuals. Results of both logistic regression approaches were similar. All other contributions to this group used extended family data, in general using the quantitative trait blood pressure. The individual contributions varied in several important aspects, such as the estimation of the kinship matrix and the estimation method. Contributors chose various approaches for model validation, including different versions of cross-validation or within-family validation. Within-family validation included model building in the upper generations and validation in later generations. The choice of the statistical model and the computational algorithm had substantial effects on computation time. If decorrelation approaches were applied, the computational burden was substantially reduced. Some software packages estimated negative eigenvalues, although eigenvalues of correlation matrices should be nonnegative. Most statistical models and software packages have been developed for experimental crosses and planned breeding programs. With their specialized pedigree structures, they are not sufficiently flexible to accommodate the variability of human pedigrees in general, and improved implementations are required. PMID:25112190

  4. A Monte Carlo sampling method of amino acid sequences adaptable to given main-chain atoms in the proteins.

    PubMed

    Ogata, Koji; Soejima, Kenji; Higo, Junichi

    2006-10-01

    We have developed a computational method of protein design to detect amino acid sequences that are adaptable to given main-chain coordinates of a protein. In this method, the selection of amino acid types employs a Metropolis Monte Carlo method with a scoring function in conjunction with the approximation of free energies computed from 3D structures. To compute the scoring function, a side-chain prediction using another Metropolis Monte Carlo method was performed to select structurally suitable side-chain conformations from a side-chain library. In total, two layers of Monte Carlo procedures were performed, first to select amino acid types (1st layer Monte Carlo) and then to predict side-chain conformations (2nd layers Monte Carlo). We applied this method to sequence design for the entire sequence on the SH3 domain, Protein G, and BPTI. The predicted sequences were similar to those of the wild-type proteins. We compared the results of the predictions with and without the 2nd layer Monte Carlo method. The results revealed that the two-layer Monte Carlo method produced better sequence similarity to the wild-type proteins than the one-layer method. Finally, we applied this method to neuraminidase of influenza virus. The results were consistent with the sequences identified from the isolated viruses.

  5. Ultrasensitive nucleic acid sequence detection by single-molecule electrophoresis

    SciTech Connect

    Castro, A; Shera, E.B.

    1996-09-01

    This is the final report of a one-year laboratory-directed research and development project at Los Alamos National Laboratory. There has been considerable interest in the development of very sensitive clinical diagnostic techniques over the last few years. Many pathogenic agents are often present in extremely small concentrations in clinical samples, especially at the initial stages of infection, making their detection very difficult. This project sought to develop a new technique for the detection and accurate quantification of specific bacterial and viral nucleic acid sequences in clinical samples. The scheme involved the use of novel hybridization probes for the detection of nucleic acids combined with our recently developed technique of single-molecule electrophoresis. This project is directly relevant to the DOE`s Defense Programs strategic directions in the area of biological warfare counter-proliferation.

  6. Deduced amino acid sequence of human pulmonary surfactant proteolipid: SPL(pVal)

    SciTech Connect

    Whitsett, J.A.; Glasser, S.W.; Korfhagen, T.R.; Weaver, T.E.; Clark, J.; Pilot-Matias, T.; Meuth, J.; Fox, J.L.

    1987-05-01

    Hydrophobic, proteolipid-like protein of Mr 6500 was isolated from ether/ethanol extracts of human, canine and bovine pulmonary surfactant. Amino acid composition of the protein demonstrated a remarkable abundance of hydrophobic residues, particularly valine and leucine. The N-terminal amino acid sequence of the human protein was determined: N-Leu-Ile-Pro-Cys-Cys-Pro-Val-Asn-Leu-Lys-Arg-Leu-Leu-Ile-Val4... An oligonucleotide probe was used to screen an adult human lung cDNA library and resulted in detection of cDNA clones with predicted amino acid sequence with close identity to the N-terminal amino acid sequence of the human peptide. SPL(pVal) was found within the reading frame of a larger peptide. SPL(pVal) results from proteolytic processing of a larger preprotein. Northern blot analysis detected in a single 1.0 kilobase SPL(pVal) RNA which was less abundant in fetal than in adult lung. Mixtures of purified canine and bovine SPL(pVal) and synthetic phospholipids display properties of rapid adsorption and surface tension lowering activity characteristic of surfactant. Human SPL(pVal) is a pulmonary surfactant proteolipid which may therefore be useful in combination with phospholipids and/or other surfactant proteins for the treatment of surfactant deficiency such as hyaline membrane disease in newborn infants.

  7. Novel Numerical Characterization of Protein Sequences Based on Individual Amino Acid and Its Application

    PubMed Central

    Zhang, Yan-ping; Sheng, Ya-jun; He, Ping-an; Ruan, Ji-shuo

    2015-01-01

    The hydrophobicity and hydrophilicity of amino acids play a very important role in protein folding and its interaction with the environment and other molecules, as well as its catalytic mechanism. Based on the two physicochemical indexes, a 2D graphical representation of protein sequences is introduced; meanwhile, a new numerical characteristic has been proposed to compute the distance of different sequences for analysis of sequence similarity/dissimilarity on the basis of this graphical representation. Furthermore, we apply the new distance in the similarities/dissimilarities of ND5 proteins of nine species and predict the four major classes based on the dataset containing 639 domains. The results show that the method is simple and effective. PMID:25705698

  8. MitoFates: improved prediction of mitochondrial targeting sequences and their cleavage sites.

    PubMed

    Fukasawa, Yoshinori; Tsuji, Junko; Fu, Szu-Chin; Tomii, Kentaro; Horton, Paul; Imai, Kenichiro

    2015-04-01

    Mitochondria provide numerous essential functions for cells and their dysfunction leads to a variety of diseases. Thus, obtaining a complete mitochondrial proteome should be a crucial step toward understanding the roles of mitochondria. Many mitochondrial proteins have been identified experimentally but a complete list is not yet available. To fill this gap, methods to computationally predict mitochondrial proteins from amino acid sequence have been developed and are widely used, but unfortunately, their accuracy is far from perfect. Here we describe MitoFates, an improved prediction method for cleavable N-terminal mitochondrial targeting signals (presequences) and their cleavage sites. MitoFates introduces novel sequence features including positively charged amphiphilicity, presequence motifs, and position weight matrices modeling the presequence cleavage sites. These features are combined with classical ones such as amino acid composition and physico-chemical properties as input to a standard support vector machine classifier. On independent test data, MitoFates attains better performance than existing predictors in both detection of presequences and in predicting their cleavage sites. We used MitoFates to look for undiscovered mitochondrial proteins from 42,217 human proteins (including isoforms such as alternative splicing or translation initiation variants). MitoFates predicts 1167 genes to have at least one isoform with a presequence. Five-hundred and eighty of these genes were not annotated as mitochondrial in either UniProt or Gene Ontology. Interestingly, these include candidate regulators of parkin translocation to damaged mitochondria, and also many genes with known disease mutations, suggesting that careful investigation of MitoFates predictions may be helpful in elucidating the role of mitochondria in health and disease. MitoFates is open source with a convenient web server publicly available.

  9. Isolation, sequencing and expression of Bartonella henselae omp43 and predicted membrane topology of the deduced protein.

    PubMed

    Burgess, A W; Paquet, J Y; Letesson, J J; Anderson, B E

    2000-08-01

    The infection of and interaction of human endothelial cells with Bartonella henselae is one of the most interesting aspects of Bartonella -associated disease. The gene encoding the 43 kDa B. henselae outer membrane protein (Omp43) that binds endothelial cells was cloned and sequenced. Sequence analysis revealed an open reading frame of 1206 nucleotides coding for a protein of 402 amino acids. Analysis of the deduced amino acid sequence shows 38% identity over the entire sequence to the Brucella spp. In addition to this Omp2b porin also shows a signal sequence and peptidase cleavage site. Cleavage of the signal peptide results in a mature 380 amino acid polypeptide with a predicted molecular weight of 42 kDa. Omp43 was expressed in Escherichia coli as a fusion protein. Purified recombinant Omp43 at concentrations of 11 and 2.75 microg/ml bound to intact human umbilical vein endothelial cells. Membrane topology analysis predicts that Omp43 exists as a 16 stranded beta barrel protein, similar to that predicted for the Omp2b Brucella abortus porin. Characterization and expression of the gene encoding Omp43 should provide a tool for further investigation of the role of adherence to endothelial cells in the pathogenesis of B. henselae. Copyright 2000 Academic Press.

  10. Nine-amino-acid transactivation domain: establishment and prediction utilities.

    PubMed

    Piskacek, Simona; Gregor, Martin; Nemethova, Maria; Grabner, Martin; Kovarik, Pavel; Piskacek, Martin

    2007-06-01

    Here we describe the establishment and prediction utilities for a novel nine-amino-acid transactivation domain, 9aa TAD, that is common to the transactivation domains of a large number of yeast and animal transcription factors. We show that the 9aa TAD motif is required for the function of the transactivation domain of Gal4 and the related transcription factors Oaf1 and Pip2. The 9aa TAD possesses an autonomous transactivation activity in yeast and mammalian cells. Using sequence alignment and experimental data we derived a pattern that can be used for 9aa TAD prediction. The pattern allows the identification of 9aa TAD in other Gal4 family members or unrelated yeast, animal, and viral transcription factors. Thus, the 9aa TAD represents the smallest known denominator for a broad range of transcription factors. The wide occurrence of the 9aa TAD suggests that this domain mediates conserved interactions with general transcriptional cofactors. A computational search for the 9aa TAD is available online from National EMBnet-Node Austria at http://www.at.embnet.org/toolbox/9aatad/.

  11. Prediction of Staphylococcus aureus Antimicrobial Resistance by Whole-Genome Sequencing

    PubMed Central

    Price, J. R.; Cole, K.; Everitt, R.; Morgan, M.; Finney, J.; Kearns, A. M.; Pichon, B.; Young, B.; Wilson, D. J.; Llewelyn, M. J.; Paul, J.; Peto, T. E. A.; Crook, D. W.; Walker, A. S.; Golubchik, T.

    2014-01-01

    Whole-genome sequencing (WGS) could potentially provide a single platform for extracting all the information required to predict an organism's phenotype. However, its ability to provide accurate predictions has not yet been demonstrated in large independent studies of specific organisms. In this study, we aimed to develop a genotypic prediction method for antimicrobial susceptibilities. The whole genomes of 501 unrelated Staphylococcus aureus isolates were sequenced, and the assembled genomes were interrogated using BLASTn for a panel of known resistance determinants (chromosomal mutations and genes carried on plasmids). Results were compared with phenotypic susceptibility testing for 12 commonly used antimicrobial agents (penicillin, methicillin, erythromycin, clindamycin, tetracycline, ciprofloxacin, vancomycin, trimethoprim, gentamicin, fusidic acid, rifampin, and mupirocin) performed by the routine clinical laboratory. We investigated discrepancies by repeat susceptibility testing and manual inspection of the sequences and used this information to optimize the resistance determinant panel and BLASTn algorithm. We then tested performance of the optimized tool in an independent validation set of 491 unrelated isolates, with phenotypic results obtained in duplicate by automated broth dilution (BD Phoenix) and disc diffusion. In the validation set, the overall sensitivity and specificity of the genomic prediction method were 0.97 (95% confidence interval [95% CI], 0.95 to 0.98) and 0.99 (95% CI, 0.99 to 1), respectively, compared to standard susceptibility testing methods. The very major error rate was 0.5%, and the major error rate was 0.7%. WGS was as sensitive and specific as routine antimicrobial susceptibility testing methods. WGS is a promising alternative to culture methods for resistance prediction in S. aureus and ultimately other major bacterial pathogens. PMID:24501024

  12. Prediction of Staphylococcus aureus antimicrobial resistance by whole-genome sequencing.

    PubMed

    Gordon, N C; Price, J R; Cole, K; Everitt, R; Morgan, M; Finney, J; Kearns, A M; Pichon, B; Young, B; Wilson, D J; Llewelyn, M J; Paul, J; Peto, T E A; Crook, D W; Walker, A S; Golubchik, T

    2014-04-01

    Whole-genome sequencing (WGS) could potentially provide a single platform for extracting all the information required to predict an organism's phenotype. However, its ability to provide accurate predictions has not yet been demonstrated in large independent studies of specific organisms. In this study, we aimed to develop a genotypic prediction method for antimicrobial susceptibilities. The whole genomes of 501 unrelated Staphylococcus aureus isolates were sequenced, and the assembled genomes were interrogated using BLASTn for a panel of known resistance determinants (chromosomal mutations and genes carried on plasmids). Results were compared with phenotypic susceptibility testing for 12 commonly used antimicrobial agents (penicillin, methicillin, erythromycin, clindamycin, tetracycline, ciprofloxacin, vancomycin, trimethoprim, gentamicin, fusidic acid, rifampin, and mupirocin) performed by the routine clinical laboratory. We investigated discrepancies by repeat susceptibility testing and manual inspection of the sequences and used this information to optimize the resistance determinant panel and BLASTn algorithm. We then tested performance of the optimized tool in an independent validation set of 491 unrelated isolates, with phenotypic results obtained in duplicate by automated broth dilution (BD Phoenix) and disc diffusion. In the validation set, the overall sensitivity and specificity of the genomic prediction method were 0.97 (95% confidence interval [95% CI], 0.95 to 0.98) and 0.99 (95% CI, 0.99 to 1), respectively, compared to standard susceptibility testing methods. The very major error rate was 0.5%, and the major error rate was 0.7%. WGS was as sensitive and specific as routine antimicrobial susceptibility testing methods. WGS is a promising alternative to culture methods for resistance prediction in S. aureus and ultimately other major bacterial pathogens.

  13. Nucleic acid (cDNA) and amino acid sequences of alpha-type gliadins from wheat (Triticum aestivum).

    PubMed Central

    Kasarda, D D; Okita, T W; Bernardin, J E; Baecker, P A; Nimmo, C C; Lew, E J; Dietler, M D; Greene, F C

    1984-01-01

    The complete amino acid sequence for an alpha-type gliadin protein of wheat (Triticum aestivum Linnaeus) endosperm has been derived from a cloned cDNA sequence. An additional cDNA clone that corresponds to about 75% of a similar alpha-type gliadin has been sequenced and shows some important differences. About 97% of the composite sequence of A-gliadin (an alpha-type gliadin fraction) has also been obtained by direct amino acid sequencing. This sequence shows a high degree of similarity with amino acid sequences derived from both cDNA clones and is virtually identical to one of them. On the basis of sequence information, after loss of the signal sequence, the mature alpha-type gliadins may be divided into five different domains, two of which may have evolved from an ancestral gliadin gene, whereas the remaining three contain repeating sequences that may have developed independently. Images PMID:6589619

  14. Innovations in host and microbial sialic acid biosynthesis revealed by phylogenomic prediction of nonulosonic acid structure

    PubMed Central

    Lewis, Amanda L.; Desa, Nolan; Hansen, Elizabeth E.; Knirel, Yuriy A.; Gordon, Jeffrey I.; Gagneux, Pascal; Nizet, Victor; Varki, Ajit

    2009-01-01

    Sialic acids (Sias) are nonulosonic acid (NulO) sugars prominently displayed on vertebrate cells and occasionally mimicked by bacterial pathogens using homologous biosynthetic pathways. It has been suggested that Sias were an animal innovation and later emerged in pathogens by convergent evolution or horizontal gene transfer. To better illuminate the evolutionary processes underlying the phenomenon of Sia molecular mimicry, we performed phylogenomic analyses of biosynthetic pathways for Sias and related higher sugars derived from 5,7-diamino-3,5,7,9-tetradeoxynon-2-ulosonic acids. Examination of ≈1,000 sequenced microbial genomes indicated that such biosynthetic pathways are far more widely distributed than previously realized. Phylogenetic analysis, validated by targeted biochemistry, was used to predict NulO types (i.e., neuraminic, legionaminic, or pseudaminic acids) expressed by various organisms. This approach uncovered previously unreported occurrences of Sia pathways in pathogenic and symbiotic bacteria and identified at least one instance in which a human archaeal symbiont tentatively reported to express Sias in fact expressed the related pseudaminic acid structure. Evaluation of targeted phylogenies and protein domain organization revealed that the “unique” Sia biosynthetic pathway of animals was instead a much more ancient innovation. Pathway phylogenies suggest that bacterial pathogens may have acquired Sia expression via adaptation of pathways for legionaminic acid biosynthesis, one of at least 3 evolutionary paths for de novo Sia synthesis. Together, these data indicate that some of the long-standing paradigms in Sia biology should be reconsidered in a wider evolutionary context of the extended family of NulO sugars. PMID:19666579

  15. Prediction of antimicrobial peptides based on sequence alignment and feature selection methods.

    PubMed

    Wang, Ping; Hu, Lele; Liu, Guiyou; Jiang, Nan; Chen, Xiaoyun; Xu, Jianyong; Zheng, Wen; Li, Li; Tan, Ming; Chen, Zugen; Song, Hui; Cai, Yu-Dong; Chou, Kuo-Chen

    2011-04-13

    Antimicrobial peptides (AMPs) represent a class of natural peptides that form a part of the innate immune system, and this kind of 'nature's antibiotics' is quite promising for solving the problem of increasing antibiotic resistance. In view of this, it is highly desired to develop an effective computational method for accurately predicting novel AMPs because it can provide us with more candidates and useful insights for drug design. In this study, a new method for predicting AMPs was implemented by integrating the sequence alignment method and the feature selection method. It was observed that, the overall jackknife success rate by the new predictor on a newly constructed benchmark dataset was over 80.23%, and the Mathews correlation coefficient is 0.73, indicating a good prediction. Moreover, it is indicated by an in-depth feature analysis that the results are quite consistent with the previously known knowledge that some amino acids are preferential in AMPs and that these amino acids do play an important role for the antimicrobial activity. For the convenience of most experimental scientists who want to use the prediction method without the interest to follow the mathematical details, a user-friendly web-server is provided at http://amp.biosino.org/.

  16. Energy minimization method using automata network for sequence and side-chain conformation prediction from given backbone geometry.

    PubMed

    Kono, H; Doi, J

    1994-07-01

    Globular proteins have high packing densities as a result of residue side chains in the core achieving a tight, complementary packing. The internal packing is considered the main determinant of native protein structure. From that point of view, we present here a method of energy minimization using an automata network to predict a set of amino acid sequences and their side-chain conformations from a desired backbone geometry for de novo design of proteins. Using discrete side-chain conformations, that is, rotamers, the sequence generation problem from a given backbone geometry becomes one of combinatorial problems. We focused on the residues composing the interior core region and predicted a set of amino acid sequences and their side-chain conformations only from a given backbone geometry. The kinds of residues were restricted to six hydrophobic amino acids (Ala, Ile, Met, Leu, Phe, and Val) because the core regions are almost always composed of hydrophobic residues. The obtained sequences were well packed as was the native sequence. The method can be used for automated sequence generation in the de novo design of proteins.

  17. Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection

    PubMed Central

    Ma, Xin; Guo, Jing; Sun, Xiao

    2015-01-01

    The prediction of RNA-binding proteins is one of the most challenging problems in computation biology. Although some studies have investigated this problem, the accuracy of prediction is still not sufficient. In this study, a highly accurate method was developed to predict RNA-binding proteins from amino acid sequences using random forests with the minimum redundancy maximum relevance (mRMR) method, followed by incremental feature selection (IFS). We incorporated features of conjoint triad features and three novel features: binding propensity (BP), nonbinding propensity (NBP), and evolutionary information combined with physicochemical properties (EIPP). The results showed that these novel features have important roles in improving the performance of the predictor. Using the mRMR-IFS method, our predictor achieved the best performance (86.62% accuracy and 0.737 Matthews correlation coefficient). High prediction accuracy and successful prediction performance suggested that our method can be a useful approach to identify RNA-binding proteins from sequence information. PMID:26543860

  18. Coevolutionary modeling of protein sequences: Predicting structure, function, and mutational landscapes

    NASA Astrophysics Data System (ADS)

    Weigt, Martin

    Over the last years, biological research has been revolutionized by experimental high-throughput techniques, in particular by next-generation sequencing technology. Unprecedented amounts of data are accumulating, and there is a growing request for computational methods unveiling the information hidden in raw data, thereby increasing our understanding of complex biological systems. Statistical-physics models based on the maximum-entropy principle have, in the last few years, played an important role in this context. To give a specific example, proteins and many non-coding RNA show a remarkable degree of structural and functional conservation in the course of evolution, despite a large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach - called Direct-Coupling Analysis - to link this sequence variability (easy to observe in sequence alignments, which are available in public sequence databases) to bio-molecular structure and function. In my presentation I will show, how this methodology can be used (i) to infer contacts between residues and thus to guide tertiary and quaternary protein structure prediction and RNA structure prediction, (ii) to discriminate interacting from non-interacting protein families, and thus to infer conserved protein-protein interaction networks, and (iii) to reconstruct mutational landscapes and thus to predict the phenotypic effect of mutations. References [1] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon and M. Weigt ''Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1'', Mol. Biol. Evol. (2015), doi: 10.1093/molbev/msv211 [2] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monasson, A. Schug, M. Weigt ''Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction'', Nucleic Acids Research (2015), doi: 10.1093/nar/gkv932 [3] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. Marks, C

  19. Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information

    SciTech Connect

    Petritis, Konstantinos; Kangas, Lars J.; Yan, Bo; Monroe, Matthew E.; Strittmatter, Eric F.; Qian, Weijun; Adkins, Joshua N.; Moore, Ronald J.; Xu, Ying; Lipton, Mary S.; Camp, David G.; Smith, Richard D.

    2006-07-15

    We describe an improved artificial neural network (ANN)-based method for predicting peptide retention times in reversed phase liquid chromatography. In addition to the peptide amino acid composition, this study investigated several other peptide descriptors to improve the predictive capability, such as peptide length, sequence, hydrophobicity and hydrophobic moment, and nearest neighbor amino acid, as well as peptide predicted structural configurations (i.e., helix, sheet, coil). An ANN architecture that consisted of 1052 input nodes, 24 hidden nodes, and 1 output node was used to fully consider the amino acid residue sequence in each peptide. The network was trained using {approx}345,000 non-redundant peptides identified from a total of 12,059 LC-MS/MS analyses of more than 20 different organisms, and the predictive capability of the model was tested using 1303 confidently identified peptides that were not included in the training set. The model demonstrated an average elution time precision of {approx}1.5% and was able to distinguish among isomeric peptides based upon the inclusion of peptide sequence information. The prediction power represents a significant improvement over our earlier report (Petritis et al., Anal. Chem. 2003, 75, 1039-1048) and other previously reported models.

  20. Yeast prions and human prion-like proteins: sequence features and prediction methods.

    PubMed

    Cascarina, Sean M; Ross, Eric D

    2014-06-01

    Prions are self-propagating infectious protein isoforms. A growing number of prions have been identified in yeast, each resulting from the conversion of soluble proteins into an insoluble amyloid form. These yeast prions have served as a powerful model system for studying the causes and consequences of prion aggregation. Remarkably, a number of human proteins containing prion-like domains, defined as domains with compositional similarity to yeast prion domains, have recently been linked to various human degenerative diseases, including amyotrophic lateral sclerosis. This suggests that the lessons learned from yeast prions may help in understanding these human diseases. In this review, we examine what has been learned about the amino acid sequence basis for prion aggregation in yeast, and how this information has been used to develop methods to predict aggregation propensity. We then discuss how this information is being applied to understand human disease, and the challenges involved in applying yeast prediction methods to higher organisms.

  1. Stereochemical Sequence Ion Selectivity: Proline versus Pipecolic-acid-containing Protonated Peptides

    NASA Astrophysics Data System (ADS)

    Abutokaikah, Maha T.; Guan, Shanshan; Bythell, Benjamin J.

    2017-01-01

    Substitution of proline by pipecolic acid, the six-membered ring congener of proline, results in vastly different tandem mass spectra. The well-known proline effect is eliminated and amide bond cleavage C-terminal to pipecolic acid dominates instead. Why do these two ostensibly similar residues produce dramatically differing spectra? Recent evidence indicates that the proton affinities of these residues are similar, so are unlikely to explain the result [Raulfs et al., J. Am. Soc. Mass Spectrom. 25, 1705-1715 (2014)]. An additional hypothesis based on increased flexibility was also advocated. Here, we provide a computational investigation of the "pipecolic acid effect," to test this and other hypotheses to determine if theory can shed additional light on this fascinating result. Our calculations provide evidence for both the increased flexibility of pipecolic-acid-containing peptides, and structural changes in the transition structures necessary to produce the sequence ions. The most striking computational finding is inversion of the stereochemistry of the transition structures leading to "proline effect"-type amide bond fragmentation between the proline/pipecolic acid-congeners: R (proline) to S (pipecolic acid). Additionally, our calculations predict substantial stabilization of the amide bond cleavage barriers for the pipecolic acid congeners by reduction in deleterious steric interactions and provide evidence for the importance of experimental energy regime in rationalizing the spectra.

  2. Sequence-Based Predictions of Lipooligosaccharide Diversity in the Neisseriaceae and Their Implication in Pathogenicity

    PubMed Central

    Stein, Daniel C.; Miller, Clinton J.; Bhoopalan, Senthil V.; Sommer, Daniel D.

    2011-01-01

    Endotoxin [Lipopolysaccharide (LPS)/Lipooligosaccharide (LOS)] is an important virulence determinant in gram negative bacteria. While the genetic basis of endotoxin production and its role in disease in the pathogenic Neisseria has been extensively studied, little research has focused on the genetic basis of LOS biosynthesis in commensal Neisseria. We determined the genomic sequences of a variety of commensal Neisseria strains, and compared these sequences, along with other genomic sequences available from various sequencing centers from commensal and pathogenic strains, to identify genes involved in LOS biosynthesis. This allowed us to make structural predictions as to differences in LOS seen between commensal and pathogenic strains. We determined that all neisserial strains possess a conserved set of genes needed to make a common 3-Deoxy-D-manno-octulosonic acid -heptose core structure. However, significant genomic differences in glycosyl transferase genes support the published literature indicating compositional differences in the terminal oligosaccharides. This was most pronounced in commensal strains that were distally related to the gonococcus and meningococcus. These strains possessed a homolog of heptosyltransferase III, suggesting that they differ from the pathogenic strains by the presence a third heptose. Furthermore, most commensal strains possess homologs of genes needed to synthesize lipopolysaccharide (LPS). N. cinerea, a commensal species that is highly related to the gonococcus has lost the ability to make sialyltransferase. Overall genomic comparisons of various neisserial strains indicate that significant recombination/genetic acquisition/loss has occurred within the genus, and this muddles proper speciation. PMID:21533118

  3. Random Amino Acid Mutations and Protein Misfolding Lead to Shannon Limit in Sequence-Structure Communication

    PubMed Central

    Lisewski, Andreas Martin

    2008-01-01

    The transmission of genomic information from coding sequence to protein structure during protein synthesis is subject to stochastic errors. To analyze transmission limits in the presence of spurious errors, Shannon's noisy channel theorem is applied to a communication channel between amino acid sequences and their structures established from a large-scale statistical analysis of protein atomic coordinates. While Shannon's theorem confirms that in close to native conformations information is transmitted with limited error probability, additional random errors in sequence (amino acid substitutions) and in structure (structural defects) trigger a decrease in communication capacity toward a Shannon limit at 0.010 bits per amino acid symbol at which communication breaks down. In several controls, simulated error rates above a critical threshold and models of unfolded structures always produce capacities below this limiting value. Thus an essential biological system can be realistically modeled as a digital communication channel that is (a) sensitive to random errors and (b) restricted by a Shannon error limit. This forms a novel basis for predictions consistent with observed rates of defective ribosomal products during protein synthesis, and with the estimated excess of mutual information in protein contact potentials. PMID:18769673

  4. Prediction of enzymes and non-enzymes from protein sequences based on sequence derived features and PSSM matrix using artificial neural network.

    PubMed

    Naik, Pradeep Kumar; Mishra, Viplav Shankar; Gupta, Mukul; Jaiswal, Kunal

    2007-12-05

    The problem of predicting the enzymes and non-enzymes from the protein sequence information is still an open problem in bioinformatics. It is further becoming more important as the number of sequenced information grows exponentially over time. We describe a novel approach for predicting the enzymes and non-enzymes from its amino-acid sequence using artificial neural network (ANN). Using 61 sequence derived features alone we have been able to achieve 79 percent correct prediction of enzymes/non-enzymes (in the set of 660 proteins). For the complete set of 61 parameters using 5-fold cross-validated classification, ANN model reveal a superior model (accuracy = 78.79 plus or minus 6.86 percent, Q(pred) = 74.734 plus or minus 17.08 percent, sensitivity = 84.48 plus or minus 6.73 percent, specificity = 77.13 plus or minus 13.39 percent). The second module of ANN is based on PSSM matrix. Using the same 5-fold cross-validation set, this ANN model predicts enzymes/non-enzymes with more accuracy (accuracy = 80.37 plus or minus 6.59 percent, Q(pred) = 67.466 plus or minus 12.41 percent, sensitivity = 0.9070 plus or minus 3.37 percent, specificity = 74.66 plus or minus 7.17 percent).

  5. Predictive genomics: a cancer hallmark network framework for predicting tumor clinical phenotypes using genome sequencing data.

    PubMed

    Wang, Edwin; Zaman, Naif; Mcgee, Shauna; Milanese, Jean-Sébastien; Masoudi-Nejad, Ali; O'Connor-McCourt, Maureen

    2015-02-01

    Tumor genome sequencing leads to documenting thousands of DNA mutations and other genomic alterations. At present, these data cannot be analyzed adequately to aid in the understanding of tumorigenesis and its evolution. Moreover, we have little insight into how to use these data to predict clinical phenotypes and tumor progression to better design patient treatment. To meet these challenges, we discuss a cancer hallmark network framework for modeling genome sequencing data to predict cancer clonal evolution and associated clinical phenotypes. The framework includes: (1) cancer hallmarks that can be represented by a few molecular/signaling networks. 'Network operational signatures' which represent gene regulatory logics/strengths enable to quantify state transitions and measures of hallmark traits. Thus, sets of genomic alterations which are associated with network operational signatures could be linked to the state/measure of hallmark traits. The network operational signature transforms genotypic data (i.e., genomic alterations) to regulatory phenotypic profiles (i.e., regulatory logics/strengths), to cellular phenotypic profiles (i.e., hallmark traits) which lead to clinical phenotypic profiles (i.e., a collection of hallmark traits). Furthermore, the framework considers regulatory logics of the hallmark networks under tumor evolutionary dynamics and therefore also includes: (2) a self-promoting positive feedback loop that is dominated by a genomic instability network and a cell survival/proliferation network is the main driver of tumor clonal evolution. Surrounding tumor stroma and its host immune systems shape the evolutionary paths; (3) cell motility initiating metastasis is a byproduct of the above self-promoting loop activity during tumorigenesis; (4) an emerging hallmark network which triggers genome duplication dominates a feed-forward loop which in turn could act as a rate-limiting step for tumor formation; (5) mutations and other genomic alterations have

  6. Acid mine drainage prediction and remediation

    SciTech Connect

    Robb, G.; Robinson, J.

    1996-12-31

    The use of constructed wetlands for treatment of acid mine drainage is discussed in the article. Drainage characteristics and mine water flow rate are identified as important predictors of remediation success. Aerobic and anaerobic chemical reaction processes are described. Problems and potential uses of wetlands are briefly described.

  7. Genome-Wide Prediction and Analysis of 3D-Domain Swapped Proteins in the Human Genome from Sequence Information

    PubMed Central

    Upadhyay, Atul Kumar; Sowdhamini, Ramanathan

    2016-01-01

    3D-domain swapping is one of the mechanisms of protein oligomerization and the proteins exhibiting this phenomenon have many biological functions. These proteins, which undergo domain swapping, have acquired much attention owing to their involvement in human diseases, such as conformational diseases, amyloidosis, serpinopathies, proteionopathies etc. Early realisation of proteins in the whole human genome that retain tendency to domain swap will enable many aspects of disease control management. Predictive models were developed by using machine learning approaches with an average accuracy of 78% (85.6% of sensitivity, 87.5% of specificity and an MCC value of 0.72) to predict putative domain swapping in protein sequences. These models were applied to many complete genomes with special emphasis on the human genome. Nearly 44% of the protein sequences in the human genome were predicted positive for domain swapping. Enrichment analysis was performed on the positively predicted sequences from human genome for their domain distribution, disease association and functional importance based on Gene Ontology (GO). Enrichment analysis was also performed to infer a better understanding of the functional importance of these sequences. Finally, we developed hinge region prediction, in the given putative domain swapped sequence, by using important physicochemical properties of amino acids. PMID:27467780

  8. Genome-Wide Prediction and Analysis of 3D-Domain Swapped Proteins in the Human Genome from Sequence Information.

    PubMed

    Upadhyay, Atul Kumar; Sowdhamini, Ramanathan

    2016-01-01

    3D-domain swapping is one of the mechanisms of protein oligomerization and the proteins exhibiting this phenomenon have many biological functions. These proteins, which undergo domain swapping, have acquired much attention owing to their involvement in human diseases, such as conformational diseases, amyloidosis, serpinopathies, proteionopathies etc. Early realisation of proteins in the whole human genome that retain tendency to domain swap will enable many aspects of disease control management. Predictive models were developed by using machine learning approaches with an average accuracy of 78% (85.6% of sensitivity, 87.5% of specificity and an MCC value of 0.72) to predict putative domain swapping in protein sequences. These models were applied to many complete genomes with special emphasis on the human genome. Nearly 44% of the protein sequences in the human genome were predicted positive for domain swapping. Enrichment analysis was performed on the positively predicted sequences from human genome for their domain distribution, disease association and functional importance based on Gene Ontology (GO). Enrichment analysis was also performed to infer a better understanding of the functional importance of these sequences. Finally, we developed hinge region prediction, in the given putative domain swapped sequence, by using important physicochemical properties of amino acids.

  9. Reticuloendotheliosis Virus Nucleic Acid Sequences in Cellular DNA

    PubMed Central

    Kang, Chil-Yong; Temin, Howard M.

    1974-01-01

    Reticuloendotheliosis virus 60S RNA labeled with 125I, or reticuloendotheliosis virus complementary DNA labeled with 3H, were hybridized to DNAs from infected chicken and pheasant cells. Most of the sequences of the viral RNA were found in the infected cell DNAs. The reticuloendotheliosis viruses, therefore, replicate through a DNA intermediate. The same labeled nucleic acids were hybridized to DNA of uninfected chicken, pheasant, quail, turkey, and duck. About 10% of the sequences of reticuloendotheliosis virus RNA were present in the DNA of uninfected chicken, pheasant, quail, and turkey. None were detected in DNA of duck. The specificity of the hybridization was shown by competition between unlabeled and 125I-labeled viral RNAs and by determination of melting temperatures. In contrast, 125I-labeled RNA of Rous-associated virus-O, an avian leukosis-sarcoma virus, hybridized 55% to DNA of uninfected chicken, 20% to DNA of uninfected pheasant, 15% to DNA of uninfected quail, 10% to DNA of uninfected turkey, and less than 1% to DNA of uninfected duck. PMID:4372393

  10. Nucleic acid (cDNA) and amino acid sequences of the maize endosperm protein glutelin-2.

    PubMed Central

    Prat, S; Cortadas, J; Puigdomènech, P; Palau, J

    1985-01-01

    The cDNA coding for a glutelin-2 protein from maize endosperm has been cloned and the complete amino acid sequence of the protein derived for the first time. An immature maize endosperm cDNA bank was screened for the expression of a beta-lactamase:glutelin-2 (G2) fusion polypeptide by using antibodies against the purified 28 kd G2 protein. A clone corresponding to the 28 kd G2 protein was sequenced and the primary structure of this protein was derived. Five regions can be defined in the protein sequence: an 11 residue N-terminal part, a repeated region formed by eight units of the sequence Pro-Pro-Pro-Val-His-Leu, an alternating Pro-X stretch 21 residues long, a Cys rich domain and a C-terminal part rich in Gln. The protein sequence is preceded by 19 residues which have the characteristics of the signal peptide found in secreted proteins. Unlike zeins, the main maize storage proteins, 28 kd glutelin-2 has several homologous sequences in common with other cereal storage proteins. Images PMID:3839076

  11. GeneMachine: gene prediction and sequence annotation.

    PubMed

    Makalowska, I; Ryan, J F; Baxevanis, A D

    2001-09-01

    A number of free-standing programs have been developed in order to help researchers find potential coding regions and deduce gene structure for long stretches of what is essentially 'anonymous DNA'. As these programs apply inherently different criteria to the question of what is and is not a coding region, multiple algorithms should be used in the course of positional cloning and positional candidate projects to assure that all potential coding regions within a previously-identified critical region are identified. We have developed a gene identification tool called GeneMachine which allows users to query multiple exon and gene prediction programs in an automated fashion. BLAST searches are also performed in order to see whether a previously-characterized coding region corresponds to a region in the query sequence. A suite of Perl programs and modules are used to run MZEF, GENSCAN, GRAIL 2, FGENES, RepeatMasker, Sputnik, and BLAST. The results of these runs are then parsed and written into ASN.1 format. Output files can be opened using NCBI Sequin, in essence using Sequin as both a workbench and as a graphical viewer. The main feature of GeneMachine is that the process is fully automated; the user is only required to launch GeneMachine and then open the resulting file with Sequin. Annotations can then be made to these results prior to submission to GenBank, thereby increasing the intrinsic value of these data. GeneMachine is freely-available for download at http://genome.nhgri.nih.gov/genemachine. A public Web interface to the GeneMachine server for academic and not-for-profit users is available at http://genemachine.nhgri.nih.gov. The Web supplement to this paper may be found at http://genome.nhgri.nih.gov/genemachine/supplement/.

  12. Complete amino acid sequence of a histidine-rich proteolytic fragment of human ceruloplasmin.

    PubMed

    Kingston, I B; Kingston, B L; Putnam, F W

    1979-04-01

    The complete amino acid sequence has been determined for a fragment of human ceruloplasmin [ferroxidase; iron(II):oxygen oxidoreductase, EC 1.16.3.1]. The fragment (designated Cp F5) contains 159 amino acid residues and has a molecular weight of 18,650; it lacks carbohydrate, is rich in histidine, and contains one free cysteine that may be part of a copper-binding site. This fragment is present in most commercial preparations of ceruloplasmin, probably owing to proteolytic degradation, but can also be obtained by limited cleavage of single-chain ceruloplasmin with plasmin. Cp F5 probably is an intact domain attached to the COOH-terminal end of single-chain ceruloplasmin via a labile interdomain peptide bond. A model of the secondary structure predicted by empirical methods suggests that almost one-third of the amino acid residues are distributed in alpha helices, about a third in beta-sheet structure, and the remainder in beta turns and unidentified structures. Computer analysis of the amino acid sequence has not demonstrated a statistically significant relationship between this ceruloplasmin fragment and any other protein, but there is some evidence for an internal duplication.

  13. The evolution of proteins from random amino acid sequences: II. Evidence from the statistical distributions of the lengths of modern protein sequences.

    PubMed

    White, S H

    1994-04-01

    This paper continues an examination of the hypothesis that modern proteins evolved from random heteropeptide sequences. In support of the hypothesis, White and Jacobs (1993, J Mol Evol 36:79-95) have shown that any sequence chosen randomly from a large collection of nonhomologous proteins has a 90% or better chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. The goal of the present study was to investigate the possibility that the random-origin hypothesis could explain the lengths of modern protein sequences without invoking specific mechanisms such as gene duplication or exon splicing. The sets of sequences examined were taken from the 1989 PIR database and consisted of 1,792 "super-family" proteins selected to have little sequence identity, 623 E. coli sequences, and 398 human sequences. The length distributions of the proteins could be described with high significance by either of two closely related probability density functions: The gamma distribution with parameter 2 or the distribution for the sum of two exponential random independent variables. A simple theory for the distributions was developed which assumes that (1) protoprotein sequences had exponentially distributed random independent lengths, (2) the length dependence of protein stability determined which of these protoproteins could fold into compact primitive proteins and thereby attain the potential for biochemical activity, (3) the useful protein sequences were preserved by the primitive genome, and (4) the resulting distribution of sequence lengths is reflected by modern proteins. The theory successfully predicts the two observed distributions which can be distinguished by the functional form of the dependence of protein stability on length. The theory leads to three interesting conclusions. First, it predicts that a tetra-nucleotide was the signal for primitive translation termination. This prediction is

  14. A systematic prediction of drug-target interactions using molecular fingerprints and protein sequences.

    PubMed

    Huang, Yu-An; You, Zhu-Hong; Chen, Xing

    2016-11-21

    Drug-Target Interactions (DTI) play a crucial role in discovering new drug candidates and finding new proteins to target for drug development. Although the number of detected DTI obtained by high-throughput techniques has been increasing, the number of known DTI is still limited. On the other hand, the experimental methods for detecting the interactions among drugs and proteins are costly and inefficient. Therefore, computational approaches for predicting DTI are drawing increasing attention in recent years. In this paper, we report a novel computational model for predicting the DTI using extremely randomized trees model and protein amino acids information. More specifically, the protein sequence is represented as a Pseudo Substitution Matrix Representation (Pseudo-SMR) descriptor in which the influence of biological evolutionary information is retained. For the representation of drug molecules, a novel fingerprint feature vector is utilized to describe its substructure information. Then the DTI pair is characterized by concatenating the two vector spaces of protein sequence and drug substructure. Finally, the proposed method is explored for predicting the DTI on four benchmark datasets: Enzyme, Ion Channel, GPCRs and Nuclear Receptor. The experimental results demonstrate that this method achieves promising prediction accuracies of 89.85%, 87.87%, 82.99% and 81.67%, respectively. For further evaluation, we compared the performance of Extremely Randomized Trees model with that of the state-of-the-art Support Vector Machine classifier. And we also compared the proposed model with existing computational models, and confirmed 15 potential drug-target interactions by looking for existing databases. The experiment results show that the proposed method is feasible and promising for predicting drug-target interactions for new drug candidate screening based on sizeable features.

  15. RNAblueprint: flexible multiple target nucleic acid sequence design.

    PubMed

    Hammer, Stefan; Tschiatschek, Birgit; Flamm, Christoph; Hofacker, Ivo L; Findeiß, Sven

    2017-09-15

    Realizing the value of synthetic biology in biotechnology and medicine requires the design of molecules with specialized functions. Due to its close structure to function relationship, and the availability of good structure prediction methods and energy models, RNA is perfectly suited to be synthetically engineered with predefined properties. However, currently available RNA design tools cannot be easily adapted to accommodate new design specifications. Furthermore, complicated sampling and optimization methods are often developed to suit a specific RNA design goal, adding to their inflexibility. We developed a C ++  library implementing a graph coloring approach to stochastically sample sequences compatible with structural and sequence constraints from the typically very large solution space. The approach allows to specify and explore the solution space in a well defined way. Our library also guarantees uniform sampling, which makes optimization runs performant by not only avoiding re-evaluation of already found solutions, but also by raising the probability of finding better solutions for long optimization runs. We show that our software can be combined with any other software package to allow diverse RNA design applications. Scripting interfaces allow the easy adaption of existing code to accommodate new scenarios, making the whole design process very flexible. We implemented example design approaches written in Python to demonstrate these advantages. RNAblueprint , Python implementations and benchmark datasets are available at github: https://github.com/ViennaRNA . s.hammer@univie.ac.at, ivo@tbi.univie.ac.at or sven@tbi.univie.ac.at. Supplementary data are available at Bioinformatics online.

  16. Complete amino acid sequence of chicken liver acyl carrier protein derived from the fatty acid synthase.

    PubMed

    Huang, W Y; Stoops, J K; Wakil, S J

    1989-04-01

    The acyl carrier protein domain of the chicken liver fatty acid synthase has been isolated after tryptic treatment of the synthase. The isolated domain functions as an acceptor of acetyl and malonyl moieties in the synthase-catalyzed transfer of these groups from their coenzyme A esters and therefore indicates that the acyl carrier protein domain exists in the complex as a discrete entity. The amino acid sequence of the acyl carrier protein was derived from analyses of peptide fragments produced by cyanogen bromide cleavage and trypsin and Staphylococcus aureus V8 protease digestions of the molecule. The isolated acyl carrier protein domain consists of 89 amino acid residues and has a calculated molecular weight of 10,127. The protein contains the phosphopantetheine group attached to the serine residue at position 38. The isolated acyl carrier protein peptide shows some sequence homology with the acyl carrier protein of Escherichia coli, particularly in the vicinity of the site of phosphopantetheine attachment, and shows extensive sequence homology with the acyl carrier protein from the uropygial gland of goose.

  17. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... 37 Patents, Trademarks, and Copyrights 1 2011-07-01 2011-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid Sequences § 1.821 Nucleotide and/or amino acid sequence disclosures in patent applications. (a) Nucleotide and...

  18. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... 37 Patents, Trademarks, and Copyrights 1 2010-07-01 2010-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid Sequences § 1.821 Nucleotide and/or amino acid sequence disclosures in patent applications. (a) Nucleotide and...

  19. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... 37 Patents, Trademarks, and Copyrights 1 2013-07-01 2013-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid Sequences § 1.821 Nucleotide and/or amino acid sequence disclosures in patent applications. (a) Nucleotide and...

  20. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... 37 Patents, Trademarks, and Copyrights 1 2012-07-01 2012-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid Sequences § 1.821 Nucleotide and/or amino acid sequence disclosures in patent applications. (a) Nucleotide and...

  1. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... 37 Patents, Trademarks, and Copyrights 1 2014-07-01 2014-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid Sequences § 1.821 Nucleotide and/or amino acid sequence disclosures in patent applications. (a) Nucleotide and...

  2. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

    PubMed

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.

  3. [Prediction of lipases types by different scale pseudo-amino acid composition].

    PubMed

    Zhang, Guangya; Li, Hongchun; Gao, Jiaqiang; Fang, Baishan

    2008-11-01

    Lipases are widely used enzymes in biotechnology. Although they catalyze the same reaction, their sequences vary. Therefore, it is highly desired to develop a fast and reliable method to identify the types of lipases according to their sequences, or even just to confirm whether they are lipases or not. By proposing two scales based pseudo amino acid composition approaches to extract the features of the sequences, a powerful predictor based on k-nearest neighbor was introduced to address the problems. The overall success rates thus obtained by the 10-fold cross-validation test were shown as below: for predicting lipases and nonlipase, the success rates were 92.8%, 91.4% and 91.3%, respectively. For lipase types, the success rates were 92.3%, 90.3% and 89.7%, respectively. Among them, the Z scales based pseudo amino acid composition was the best, T scales was the second. They outperformed significantly than 6 other frequently used sequence feature extraction methods. The high success rates yielded for such a stringent dataset indicate predicting the types of lipases is feasible and the different scales pseudo amino acid composition might be a useful tool for extracting the features of protein sequences, or at lease can play a complementary role to many of the other existing approaches.

  4. Sequence and structural features of carbohydrate binding in proteins and assessment of predictability using a neural network

    PubMed Central

    Malik, Adeel; Ahmad, Shandar

    2007-01-01

    Background Protein-Carbohydrate interactions are crucial in many biological processes with implications to drug targeting and gene expression. Nature of protein-carbohydrate interactions may be studied at individual residue level by analyzing local sequence and structure environments in binding regions in comparison to non-binding regions, which provide an inherent control for such analyses. With an ultimate aim of predicting binding sites from sequence and structure, overall statistics of binding regions needs to be compiled. Sequence-based predictions of binding sites have been successfully applied to DNA-binding proteins in our earlier works. We aim to apply similar analysis to carbohydrate binding proteins. However, due to a relatively much smaller region of proteins taking part in such interactions, the methodology and results are significantly different. A comparison of protein-carbohydrate complexes has also been made with other protein-ligand complexes. Results We have compiled statistics of amino acid compositions in binding versus non-binding regions- general as well as in each different secondary structure conformation. Binding propensities of each of the 20 residue types and their structure features such as solvent accessibility, packing density and secondary structure have been calculated to assess their predisposition to carbohydrate interactions. Finally, evolutionary profiles of amino acid sequences have been used to predict binding sites using a neural network. Another set of neural networks was trained using information from single sequences and the prediction performance from the evolutionary profiles and single sequences were compared. Best of the neural network based prediction could achieve an 87% sensitivity of prediction at 23% specificity for all carbohydrate-binding sites, using evolutionary information. Single sequences gave 68% sensitivity and 55% specificity for the same data set. Sensitivity and specificity for a limited galactose

  5. Human liver apolipoprotein B-100 cDNA: complete nucleic acid and derived amino acid sequence.

    PubMed Central

    Law, S W; Grant, S M; Higuchi, K; Hospattankar, A; Lackner, K; Lee, N; Brewer, H B

    1986-01-01

    Human apolipoprotein B-100 (apoB-100), the ligand on low density lipoproteins that interacts with the low density lipoprotein receptor and initiates receptor-mediated endocytosis and low density lipoprotein catabolism, has been cloned, and the complete nucleic acid and derived amino acid sequences have been determined. ApoB-100 cDNAs were isolated from normal human liver cDNA libraries utilizing immunoscreening as well as filter hybridization with radiolabeled apoB-100 oligodeoxynucleotides. The apoB-100 mRNA is 14.1 kilobases long encoding a mature apoB-100 protein of 4536 amino acids with a calculated amino acid molecular weight of 512,723. ApoB-100 contains 20 potential glycosylation sites, and 12 of a total of 25 cysteine residues are located in the amino-terminal region of the apolipoprotein providing a potential globular structure of the amino terminus of the protein. ApoB-100 contains relatively few regions of amphipathic helices, but compared to other human apolipoproteins it is enriched in beta-structure. The delineation of the entire human apoB-100 sequence will now permit a detailed analysis of the conformation of the protein, the low density lipoprotein receptor binding domain(s), and the structural relationship between apoB-100 and apoB-48 and will provide the basis for the study of genetic defects in apoB-100 in patients with dyslipoproteinemias. PMID:3464946

  6. Computer selection of oligonucleotide probes from amino acid sequences for use in gene library screening.

    PubMed

    Yang, J H; Ye, J H; Wallace, D C

    1984-01-11

    We present a computer program, FINPROBE, which utilizes known amino acid sequence data to deduce minimum redundancy oligonucleotide probes for use in screening cDNA or genomic libraries or in primer extension. The user enters the amino acid sequence of interest, the desired probe length, the number of probes sought, and the constraints on oligonucleotide synthesis. The computer generates a table of possible probes listed in increasing order of redundancy and provides the location of each probe in the protein and mRNA coding sequence. Activation of a next function provides the amino acid and mRNA sequences of each probe of interest as well as the complementary sequence and the minimum dissociation temperature of the probe. A final routine prints out the amino acid sequence of the protein in parallel with the mRNA sequence listing all possible codons for each amino acid.

  7. Predicted Molecular Effects of Sequence Variants Link to System Level of Disease

    PubMed Central

    Bromberg, Yana; Rost, Burkhard

    2016-01-01

    Developments in experimental and computational biology are advancing our understanding of how protein sequence variation impacts molecular protein function. However, the leap from the micro level of molecular function to the macro level of the whole organism, e.g. disease, remains barred. Here, we present new results emphasizing earlier work that suggested some links from molecular function to disease. We focused on non-synonymous single nucleotide variants, also referred to as single amino acid variants (SAVs). Building upon OMIA (Online Mendelian Inheritance in Animals), we introduced a curated set of 117 disease-causing SAVs in animals. Methods optimized to capture effects upon molecular function often correctly predict human (OMIM) and animal (OMIA) Mendelian disease-causing variants. We also predicted effects of human disease-causing variants in the mouse model, i.e. we put OMIM SAVs into mouse orthologs. Overall, fewer variants were predicted with effect in the model organism than in the original organism. Our results, along with other recent studies, demonstrate that predictions of molecular effects capture some important aspects of disease. Thus, in silico methods focusing on the micro level of molecular function can help to understand the macro system level of disease. PMID:27536940

  8. AMS 4.0: consensus prediction of post-translational modifications in protein sequences.

    PubMed

    Plewczynski, Dariusz; Basu, Subhadip; Saha, Indrajit

    2012-08-01

    We present here the 2011 update of the AutoMotif Service (AMS 4.0) that predicts the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. The sequence vicinity of each modified residue is represented using amino acids physico-chemical features encoded using high quality indices (HQI) obtaining by automatic clustering of known indices extracted from AAindex database. For each type of the numerical representation, the method builds the ensemble of Multi-Layer Perceptron (MLP) pattern classifiers, each optimising different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. Summarising, the brainstorming consensus meta-learning methodology on the average boosts the AUC score up to around 89 %, averaged over all 88 PTM types. Detailed results for single machine learning methods and the consensus methodology are also provided, together with the comparison to previously published methods and state-of-the-art software tools. The

  9. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... for nucleotide and/or amino acid sequence data. 1.822 Section 1.822 Patents, Trademarks, and... Amino Acid Sequences § 1.822 Symbols and format to be used for nucleotide and/or amino acid sequence data. (a) The symbols and format to be used for nucleotide and/or amino acid sequence data shall...

  10. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... for nucleotide and/or amino acid sequence data. 1.822 Section 1.822 Patents, Trademarks, and... Amino Acid Sequences § 1.822 Symbols and format to be used for nucleotide and/or amino acid sequence data. (a) The symbols and format to be used for nucleotide and/or amino acid sequence data shall...

  11. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... for nucleotide and/or amino acid sequence data. 1.822 Section 1.822 Patents, Trademarks, and... Amino Acid Sequences § 1.822 Symbols and format to be used for nucleotide and/or amino acid sequence data. (a) The symbols and format to be used for nucleotide and/or amino acid sequence data shall...

  12. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... for nucleotide and/or amino acid sequence data. 1.822 Section 1.822 Patents, Trademarks, and... Amino Acid Sequences § 1.822 Symbols and format to be used for nucleotide and/or amino acid sequence data. (a) The symbols and format to be used for nucleotide and/or amino acid sequence data shall...

  13. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... for nucleotide and/or amino acid sequence data. 1.822 Section 1.822 Patents, Trademarks, and... Amino Acid Sequences § 1.822 Symbols and format to be used for nucleotide and/or amino acid sequence data. (a) The symbols and format to be used for nucleotide and/or amino acid sequence data shall...

  14. NASP: a parallel program for identifying evolutionarily conserved nucleic acid secondary structures from nucleotide sequence alignments.

    PubMed

    Semegni, J Y; Wamalwa, M; Gaujoux, R; Harkins, G W; Gray, A; Martin, D P

    2011-09-01

    Many natural nucleic acid sequences have evolutionarily conserved secondary structures with diverse biological functions. A reliable computational tool for identifying such structures would be very useful in guiding experimental analyses of their biological functions. NASP (Nucleic Acid Structure Predictor) is a program that takes into account thermodynamic stability, Boltzmann base pair probabilities, alignment uncertainty, covarying sites and evolutionary conservation to identify biologically relevant secondary structures within multiple sequence alignments. Unique to NASP is the consideration of all this information together with a recursive permutation-based approach to progressively identify and list the most conserved probable secondary structures that are likely to have the greatest biological relevance. By focusing on identifying only evolutionarily conserved structures, NASP forgoes the prediction of complete nucleotide folds but outperforms various other secondary structure prediction methods in its ability to selectively identify actual base pairings. Downloable and web-based versions of NASP are freely available at http://web.cbio.uct.ac.za/~yves/nasp_portal.php yves@cbio.uct.ac.za Supplementary data are available at Bioinformatics online.

  15. Can Computationally Designed Protein Sequences Improve Secondary Structure Prediction?

    DTIC Science & Technology

    2011-01-01

    SSP. We use the RosettaDesign program to generate sequences that are com- patible with the structural classification of proteins ( SCOP ) database of...1997) using a significantly larger database of known structures than previously reported in the literature. Methods In this work, the Astral SCOP 1.75...6511 SCOP 1.75 domains were used after some domains were discarded due to large missing segments (Nres . 10), non-contiguities in the domain sequence

  16. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds.

    PubMed Central

    Overington, J.; Donnelly, D.; Johnson, M. S.; Sali, A.; Blundell, T. L.

    1992-01-01

    The local environment of an amino acid in a folded protein determines the acceptability of mutations at that position. In order to characterize and quantify these structural constraints, we have made a comparative analysis of families of homologous proteins. Residues in each structure are classified according to amino acid type, secondary structure, accessibility of the side chain, and existence of hydrogen bonds from the side chains. Analysis of the pattern of observed substitutions as a function of local environment shows that there are distinct patterns, especially for buried polar residues. The substitution data tables are available on diskette with Protein Science. Given the fold of a protein, one is able to predict sequences compatible with the fold (profiles or templates) and potentially to discriminate between a correctly folded and misfolded protein. Conversely, analysis of residue variation across a family of aligned sequences in terms of substitution profiles can allow prediction of secondary structure or tertiary environment. PMID:1304904

  17. Human retroviruses and AIDS 1996. A compilation and analysis of nucleic acid and amino acid sequences

    SciTech Connect

    Myers, G.; Foley, B.; Korber, B.; Mellors, J.W.; Jeang, K.T.; Wain-Hobson, S.

    1997-04-01

    This compendium and the accompanying floppy diskettes are the result of an effort to compile and rapidly publish all relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses. The scope of the compendium and database is best summarized by the five parts that it comprises: (1) Nuclear Acid Alignments and Sequences; (2) Amino Acid Alignments; (3) Analysis; (4) Related Sequences; and (5) Database Communications. Information within all the parts is updated throughout the year on the Web site, http://hiv-web.lanl.gov. While this publication could take the form of a review or sequence monograph, it is not so conceived. Instead, the literature from which the database is derived has simply been summarized and some elementary computational analyses have been performed upon the data. Interpretation and commentary have been avoided insofar as possible so that the reader can form his or her own judgments concerning the complex information. In addition to the general descriptions of the parts of the compendium, the user should read the individual introductions for each part.

  18. EASE-MM: Sequence-Based Prediction of Mutation-Induced Stability Changes with Feature-Based Multiple Models.

    PubMed

    Folkman, Lukas; Stantic, Bela; Sattar, Abdul; Zhou, Yaoqi

    2016-03-27

    Protein engineering and characterisation of non-synonymous single nucleotide variants (SNVs) require accurate prediction of protein stability changes (ΔΔGu) induced by single amino acid substitutions. Here, we have developed a new prediction method called Evolutionary, Amino acid, and Structural Encodings with Multiple Models (EASE-MM), which comprises five specialised support vector machine (SVM) models and makes the final prediction from a consensus of two models selected based on the predicted secondary structure and accessible surface area of the mutated residue. The new method is applicable to single-domain monomeric proteins and can predict ΔΔGu with a protein sequence and mutation as the only inputs. EASE-MM yielded a Pearson correlation coefficient of 0.53-0.59 in 10-fold cross-validation and independent testing and was able to outperform other sequence-based methods. When compared to structure-based energy functions, EASE-MM achieved a comparable or better performance. The application to a large dataset of human germline non-synonymous SNVs showed that the disease-causing variants tend to be associated with larger magnitudes of ΔΔGu predicted with EASE-MM. The EASE-MM web-server is available at http://sparks-lab.org/server/ease.

  19. Transcriptome Sequencing in Response to Salicylic Acid in Salvia miltiorrhiza

    PubMed Central

    Zhang, Xiaoru; Dong, Juane; Liu, Hailong; Wang, Jiao; Qi, Yuexin; Liang, Zongsuo

    2016-01-01

    Salvia miltiorrhiza is a traditional Chinese herbal medicine, whose quality and yield are often affected by diseases and environmental stresses during its growing season. Salicylic acid (SA) plays a significant role in plants responding to biotic and abiotic stresses, but the involved regulatory factors and their signaling mechanisms are largely unknown. In order to identify the genes involved in SA signaling, the RNA sequencing (RNA-seq) strategy was employed to evaluate the transcriptional profiles in S. miltiorrhiza cell cultures. A total of 50,778 unigenes were assembled, in which 5,316 unigenes were differentially expressed among 0-, 2-, and 8-h SA induction. The up-regulated genes were mainly involved in stimulus response and multi-organism process. A core set of candidate novel genes coding SA signaling component proteins was identified. Many transcription factors (e.g., WRKY, bHLH and GRAS) and genes involved in hormone signal transduction were differentially expressed in response to SA induction. Detailed analysis revealed that genes associated with defense signaling, such as antioxidant system genes, cytochrome P450s and ATP-binding cassette transporters, were significantly overexpressed, which can be used as genetic tools to investigate disease resistance. Our transcriptome analysis will help understand SA signaling and its mechanism of defense systems in S. miltiorrhiza. PMID:26808150

  20. RDNAnalyzer: A tool for DNA secondary structure prediction and sequence analysis.

    PubMed

    Afzal, Muhammad; Shahid, Ahmad Ali; Shehzadi, Abida; Nadeem, Shahid; Husnain, Tayyab

    2012-01-01

    RDNAnalyzer is an innovative computer based tool designed for DNA secondary structure prediction and sequence analysis. It can randomly generate the DNA sequence or user can upload the sequences of their own interest in RAW format. It uses and extends the Nussinov dynamic programming algorithm and has various application for the sequence analysis. It predicts the DNA secondary structure and base pairings. It also provides the tools for routinely performed sequence analysis by the biological scientists such as DNA replication, reverse compliment generation, transcription, translation, sequence specific information as total number of nucleotide bases, ATGC base contents along with their respective percentages and sequence cleaner. RDNAnalyzer is a unique tool developed in Microsoft Visual Studio 2008 using Microsoft Visual C# and Windows Presentation Foundation and provides user friendly environment for sequence analysis. It is freely available. http://www.cemb.edu.pk/sw.html RDNAnalyzer - Random DNA Analyser, GUI - Graphical user interface, XAML - Extensible Application Markup Language.

  1. Predicting Shine–Dalgarno Sequence Locations Exposes Genome Annotation Errors

    PubMed Central

    Starmer, J; Stomp, A; Vouk, M; Bitzer, D

    2006-01-01

    In prokaryotes, Shine–Dalgarno (SD) sequences, nucleotides upstream from start codons on messenger RNAs (mRNAs) that are complementary to ribosomal RNA (rRNA), facilitate the initiation of protein synthesis. The location of SD sequences relative to start codons and the stability of the hybridization between the mRNA and the rRNA correlate with the rate of synthesis. Thus, accurate characterization of SD sequences enhances our understanding of how an organism's transcriptome relates to its cellular proteome. We implemented the Individual Nearest Neighbor Hydrogen Bond model for oligo–oligo hybridization and created a new metric, relative spacing (RS), to identify both the location and the hybridization potential of SD sequences by simulating the binding between mRNAs and single-stranded 16S rRNA 3′ tails. In 18 prokaryote genomes, we identified 2,420 genes out of 58,550 where the strongest binding in the translation initiation region included the start codon, deviating from the expected location for the SD sequence of five to ten bases upstream. We designated these as RS+1 genes. Additional analysis uncovered an unusual bias of the start codon in that the majority of the RS+1 genes used GUG, not AUG. Furthermore, of the 624 RS+1 genes whose SD sequence was associated with a free energy release of less than −8.4 kcal/mol (strong RS+1 genes), 384 were within 12 nucleotides upstream of in-frame initiation codons. The most likely explanation for the unexpected location of the SD sequence for these 384 genes is mis-annotation of the start codon. In this way, the new RS metric provides an improved method for gene sequence annotation. The remaining strong RS+1 genes appear to have their SD sequences in an unexpected location that includes the start codon. Thus, our RS metric provides a new way to explore the role of rRNA–mRNA nucleotide hybridization in translation initiation. PMID:16710451

  2. Human retroviruses and aids, 1992. A compilation and analysis of nucleic acid and amino acid sequences

    SciTech Connect

    Myers, G.; Korber, B.; Berzofsky, J.A.; Pavlakis, G.N.; Smith, R.F.

    1992-10-01

    This compendium and the accompanying floppy diskettes are the result of an effort to compile and rapidly publish all relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses. The scope of the compendium and database is best summarized by the five parts that it comprises: (1) HIV and SIV Nucleotide Sequences; (H) Amino Acid Sequences; (III) Analyses; (IV) Related Sequences; and (V) Database Communications. information within all the parts is updated at least twice in each year, which accounts for the modes of binding and pagination in the compendium. While this publication could take the form of a review or sequence monograph, it is not so conceived. Instead, the literature from which the database is derived has simply been summarized and some elementary computational analyses have been performed upon the data. Interpretation and commentary have been avoided insofar as possible so that the reader can form his or her own judgments concerning the complex information. In addition to the general descriptions below of the parts of the compendium, the user should read the individual introductions for each part.

  3. Uric acid excretion predicts increased aggression in urban adolescents.

    PubMed

    Mrug, Sylvie; Mrug, Michal

    2016-09-01

    Elevated levels of uric acid have been linked with impulsive and disinhibited behavior in clinical and community populations of adults, but no studies have examined uric acid in relation to adolescent aggression. This study examined the prospective role of uric acid in aggressive behavior among urban, low income adolescents, and whether this relationship varies by gender. A total of 84 adolescents (M age 13.36years; 50% male; 95% African American) self-reported on their physical aggression at baseline and 1.5years later. At baseline, the youth also completed a 12-h (overnight) urine collection at home which was used to measure uric acid excretion. After adjusting for baseline aggression and age, greater uric acid excretion predicted more frequent aggressive behavior at follow up, with no significant gender differences. The results suggest that lowering uric acid levels may help reduce youth aggression. Copyright © 2016 Elsevier Inc. All rights reserved.

  4. NetTurnP – Neural Network Prediction of Beta-turns by Use of Evolutionary Information and Predicted Protein Sequence Features

    PubMed Central

    Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl

    2010-01-01

    β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC  = 0.50, Qtotal = 82.1%, sensitivity  = 75.6%, PPV  = 68.8% and AUC  = 0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17 – 0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. Conclusion The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences. PMID:21152409

  5. A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%.

    PubMed Central

    Mehta, P. K.; Heringa, J.; Argos, P.

    1995-01-01

    To improve secondary structure predictions in protein sequences, the information residing in multiple sequence alignments of substituted but structurally related proteins is exploited. A database comprised of 70 protein families and a total of 2,500 sequences, some of which were aligned by tertiary structural superpositions, was used to calculate residue exchange weight matrices within alpha-helical, beta-strand, and coil substructures, respectively. Secondary structure predictions were made based on the observed residue substitutions in local regions of the multiple alignments and the largest possible associated exchange weights in each of the three matrix types. Comparison of the observed and predicted secondary structure on a per-residue basis yielded a mean accuracy of 72.2%. Individual alpha-helix, beta-strand, and coil states were respectively predicted at 66.7, and 75.8% correctness, representing a well-balanced three-state prediction. The accuracy level, verified by cross-validation through jack-knife tests on all protein families, dropped, on average, to only 70.9%, indicating the rigor of the prediction procedure. On the basis of robustness, conceptual clarity, accuracy, and executable efficiency, the method has considerable advantage, especially with its sole reliance on amino acid substitutions within structurally related proteins. PMID:8580842

  6. PreSSAPro: a software for the prediction of secondary structure by amino acid properties.

    PubMed

    Costantini, Susan; Colonna, Giovanni; Facchiano, Angelo M

    2007-10-01

    PreSSAPro is a software, available to the scientific community as a free web service designed to provide predictions of secondary structures starting from the amino acid sequence of a given protein. Predictions are based on our recently published work on the amino acid propensities for secondary structures in either large but not homogeneous protein data sets, as well as in smaller but homogeneous data sets corresponding to protein structural classes, i.e. all-alpha, all-beta, or alpha-beta proteins. Predictions result improved by the use of propensities evaluated for the right protein class. PreSSAPro predicts the secondary structure according to the right protein class, if known, or gives a multiple prediction with reference to the different structural classes. The comparison of these predictions represents a novel tool to evaluate what sequence regions can assume different secondary structures depending on the structural class assignment, in the perspective of identifying proteins able to fold in different conformations. The service is available at the URL http://bioinformatica.isa.cnr.it/PRESSAPRO/.

  7. Genome Sequence Analysis of the Naphthenic Acid Degrading and Metal Resistant Bacterium Cupriavidus gilardii CR3

    PubMed Central

    Xiao, Jingfa; Hao, Lirui; Crowley, David E.; Zhang, Zhewen; Yu, Jun; Huang, Ning; Huo, Mingxin; Wu, Jiayan

    2015-01-01

    Cupriavidus sp. are generally heavy metal tolerant bacteria with the ability to degrade a variety of aromatic hydrocarbon compounds, although the degradation pathways and substrate versatilities remain largely unknown. Here we studied the bacterium Cupriavidus gilardii strain CR3, which was isolated from a natural asphalt deposit, and which was shown to utilize naphthenic acids as a sole carbon source. Genome sequencing of C. gilardii CR3 was carried out to elucidate possible mechanisms for the naphthenic acid biodegradation. The genome of C. gilardii CR3 was composed of two circular chromosomes chr1 and chr2 of respectively 3,539,530 bp and 2,039,213 bp in size. The genome for strain CR3 encoded 4,502 putative protein-coding genes, 59 tRNA genes, and many other non-coding genes. Many genes were associated with xenobiotic biodegradation and metal resistance functions. Pathway prediction for degradation of cyclohexanecarboxylic acid, a representative naphthenic acid, suggested that naphthenic acid undergoes initial ring-cleavage, after which the ring fission products can be degraded via several plausible degradation pathways including a mechanism similar to that used for fatty acid oxidation. The final metabolic products of these pathways are unstable or volatile compounds that were not toxic to CR3. Strain CR3 was also shown to have tolerance to at least 10 heavy metals, which was mainly achieved by self-detoxification through ion efflux, metal-complexation and metal-reduction, and a powerful DNA self-repair mechanism. Our genomic analysis suggests that CR3 is well adapted to survive the harsh environment in natural asphalts containing naphthenic acids and high concentrations of heavy metals. PMID:26301592

  8. Genome Sequence Analysis of the Naphthenic Acid Degrading and Metal Resistant Bacterium Cupriavidus gilardii CR3.

    PubMed

    Wang, Xiaoyu; Chen, Meili; Xiao, Jingfa; Hao, Lirui; Crowley, David E; Zhang, Zhewen; Yu, Jun; Huang, Ning; Huo, Mingxin; Wu, Jiayan

    2015-01-01

    Cupriavidus sp. are generally heavy metal tolerant bacteria with the ability to degrade a variety of aromatic hydrocarbon compounds, although the degradation pathways and substrate versatilities remain largely unknown. Here we studied the bacterium Cupriavidus gilardii strain CR3, which was isolated from a natural asphalt deposit, and which was shown to utilize naphthenic acids as a sole carbon source. Genome sequencing of C. gilardii CR3 was carried out to elucidate possible mechanisms for the naphthenic acid biodegradation. The genome of C. gilardii CR3 was composed of two circular chromosomes chr1 and chr2 of respectively 3,539,530 bp and 2,039,213 bp in size. The genome for strain CR3 encoded 4,502 putative protein-coding genes, 59 tRNA genes, and many other non-coding genes. Many genes were associated with xenobiotic biodegradation and metal resistance functions. Pathway prediction for degradation of cyclohexanecarboxylic acid, a representative naphthenic acid, suggested that naphthenic acid undergoes initial ring-cleavage, after which the ring fission products can be degraded via several plausible degradation pathways including a mechanism similar to that used for fatty acid oxidation. The final metabolic products of these pathways are unstable or volatile compounds that were not toxic to CR3. Strain CR3 was also shown to have tolerance to at least 10 heavy metals, which was mainly achieved by self-detoxification through ion efflux, metal-complexation and metal-reduction, and a powerful DNA self-repair mechanism. Our genomic analysis suggests that CR3 is well adapted to survive the harsh environment in natural asphalts containing naphthenic acids and high concentrations of heavy metals.

  9. SuSPect: Enhanced Prediction of Single Amino Acid Variant (SAV) Phenotype Using Network Features

    PubMed Central

    Yates, Christopher M.; Filippis, Ioannis; Kelley, Lawrence A.; Sternberg, Michael J.E.

    2014-01-01

    Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html. PMID:24810707

  10. SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features.

    PubMed

    Yates, Christopher M; Filippis, Ioannis; Kelley, Lawrence A; Sternberg, Michael J E

    2014-07-15

    Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html. Copyright © 2014. Published by Elsevier Ltd.

  11. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes.

    PubMed

    Jespersen, Martin Closter; Peters, Bjoern; Nielsen, Morten; Marcatili, Paolo

    2017-05-02

    Antibodies have become an indispensable tool for many biotechnological and clinical applications. They bind their molecular target (antigen) by recognizing a portion of its structure (epitope) in a highly specific manner. The ability to predict epitopes from antigen sequences alone is a complex task. Despite substantial effort, limited advancement has been achieved over the last decade in the accuracy of epitope prediction methods, especially for those that rely on the sequence of the antigen only. Here, we present BepiPred-2.0 (http://www.cbs.dtu.dk/services/BepiPred/), a web server for predicting B-cell epitopes from antigen sequences. BepiPred-2.0 is based on a random forest algorithm trained on epitopes annotated from antibody-antigen protein structures. This new method was found to outperform other available tools for sequence-based epitope prediction both on epitope data derived from solved 3D structures, and on a large collection of linear epitopes downloaded from the IEDB database. The method displays results in a user-friendly and informative way, both for computer-savvy and non-expert users. We believe that BepiPred-2.0 will be a valuable tool for the bioinformatics and immunology community. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Completion of the amino acid sequence of the alpha 1 chain from type I calf skin collagen. Amino acid sequence of alpha 1(I)B8.

    PubMed Central

    Glanville, R W; Breitkreutz, D; Meitinger, M; Fietzek, P P

    1983-01-01

    The complete amino acid sequence of the 279-residue CNBr peptide CB8 from the alpha 1 chain of type I calf skin collagen is presented. It was determined by sequencing overlapping fragments of CB8 produced by Staphylococcus aureus V8 proteinase, trypsin, Endoproteinase Arg-C and hydroxylamine. Tryptic cleavages were also made specific for lysine by blocking arginine residues with cyclohexane-1,2-dione. This completes the amino acid sequence analysis of the 1054-residues-long alpha (I) chain of calf skin collagen. PMID:6354180

  13. OrfPredictor: predicting protein-coding regions in EST-derived sequences

    PubMed Central

    Min, Xiang Jia; Butler, Gregory; Storms, Reginald; Tsang, Adrian

    2005-01-01

    OrfPredictor is a web server designed for identifying protein-coding regions in expressed sequence tag (EST)-derived sequences. For query sequences with a hit in BLASTX, the program predicts the coding regions based on the translation reading frames identified in BLASTX alignments, otherwise, it predicts the most probable coding region based on the intrinsic signals of the query sequences. The output is the predicted peptide sequences in the FASTA format, and a definition line that includes the query ID, the translation reading frame and the nucleotide positions where the coding region begins and ends. OrfPredictor facilitates the annotation of EST-derived sequences, particularly, for large-scale EST projects. OrfPredictor is available at . PMID:15980561

  14. A novel web server predicts amino acid residue protection against hydrogen-deuterium exchange.

    PubMed

    Lobanov, Mikhail Yu; Suvorina, Masha Yu; Dovidchenko, Nikita V; Sokolovskiy, Igor V; Surin, Alexey K; Galzitskaya, Oxana V

    2013-06-01

    To clarify the relationship between structural elements and polypeptide chain mobility, a set of statistical analyses of structures is necessary. Because at present proteins with determined spatial structures are much less numerous than those with amino acid sequence known, it is important to be able to predict the extent of proton protection from hydrogen-deuterium (HD) exchange basing solely on the protein primary structure. Here we present a novel web server aimed to predict the degree of amino acid residue protection against HD exchange solely from the primary structure of the protein chain under study. On the basis of the amino acid sequence, the presented server offers the following three possibilities (predictors) for user's choice. First, prediction of the number of contacts occurring in this protein, which is shown to be helpful in estimating the number of protons protected against HD exchange (sensitivity 0.71). Second, probability of H-bonding in this protein, which is useful for finding the number of unprotected protons (specificity 0.71). The last is the use of an artificial predictor. Also, we report on mass spectrometry analysis of HD exchange that has been first applied to free amino acids. Its results showed a good agreement with theoretical data (number of protons) for 10 globular proteins (correlation coefficient 0.73). We pioneered in compiling two datasets of experimental HD exchange data for 35 proteins. The H-Protection server is available for users at http://bioinfo.protres.ru/ogp/ Supplementary data are available at Bioinformatics online.

  15. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences

    PubMed Central

    Xu, Zhenjiang; Mathews, David H.

    2011-01-01

    Motivation: With recent advances in sequencing, structural and functional studies of RNA lag behind the discovery of sequences. Computational analysis of RNA is increasingly important to reveal structure–function relationships with low cost and speed. The purpose of this study is to use multiple homologous sequences to infer a conserved RNA structure. Results: A new algorithm, called Multilign, is presented to find the lowest free energy RNA secondary structure common to multiple sequences. Multilign is based on Dynalign, which is a program that simultaneously aligns and folds two sequences to find the lowest free energy conserved structure. For Multilign, Dynalign is used to progressively construct a conserved structure from multiple pairwise calculations, with one sequence used in all pairwise calculations. A base pair is predicted only if it is contained in the set of low free energy structures predicted by all Dynalign calculations. In this way, Multilign improves prediction accuracy by keeping the genuine base pairs and excluding competing false base pairs. Multilign has computational complexity that scales linearly in the number of sequences. Multilign was tested on extensive datasets of sequences with known structure and its prediction accuracy is among the best of available algorithms. Multilign can run on long sequences (> 1500 nt) and an arbitrarily large number of sequences. Availability: The algorithm is implemented in ANSI C++ and can be downloaded as part of the RNAstructure package at: http://rna.urmc.rochester.edu Contact: david_mathews@urmc.rochester.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21193521

  16. MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts.

    PubMed

    Deng, Xin; Cheng, Jianlin

    2011-12-14

    Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields. We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores. MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.

  17. SeqAPASS (Sequence Alignment to Predict Across Species Susceptibility) software and documentation

    EPA Science Inventory

    SeqAPASS is a software application facilitates rapid and streamlined, yet transparent, comparisons of the similarity of toxicologically-significant molecular targets across species. The present application facilitates analysis of primary amino acid sequence similarity (including ...

  18. SeqAPASS (Sequence Alignment to Predict Across Species Susceptibility) software and documentation

    EPA Science Inventory

    SeqAPASS is a software application facilitates rapid and streamlined, yet transparent, comparisons of the similarity of toxicologically-significant molecular targets across species. The present application facilitates analysis of primary amino acid sequence similarity (including ...

  19. Can Computationally Designed Protein Sequences Improve Secondary Structure Prediction?

    DTIC Science & Technology

    2011-01-01

    with the structural classification of proteins ( SCOP ) database of known structural domains (Kuhlman and Baker, 2000; Rohl et al., 2004). Secondary...reported in the literature. Methods In this work, the Astral SCOP 1.75 (Murzin et al., 1995; Hubbard et al., 1999) structural domain database filtered...entry matching the query test sequence can be left out. A total of 6511 SCOP 1.75 domains were used after some domains were discarded due to large

  20. Next Generation Sequencing in Predicting Gene Function in Podophyllotoxin Biosynthesis*

    PubMed Central

    Marques, Joaquim V.; Kim, Kye-Won; Lee, Choonseok; Costa, Michael A.; May, Gregory D.; Crow, John A.; Davin, Laurence B.; Lewis, Norman G.

    2013-01-01

    Podophyllum species are sources of (−)-podophyllotoxin, an aryltetralin lignan used for semi-synthesis of various powerful and extensively employed cancer-treating drugs. Its biosynthetic pathway, however, remains largely unknown, with the last unequivocally demonstrated intermediate being (−)-matairesinol. Herein, massively parallel sequencing of Podophyllum hexandrum and Podophyllum peltatum transcriptomes and subsequent bioinformatics analyses of the corresponding assemblies were carried out. Validation of the assembly process was first achieved through confirmation of assembled sequences with those of various genes previously established as involved in podophyllotoxin biosynthesis as well as other candidate biosynthetic pathway genes. This contribution describes characterization of two of the latter, namely the cytochrome P450s, CYP719A23 from P. hexandrum and CYP719A24 from P. peltatum. Both enzymes were capable of converting (−)-matairesinol into (−)-pluviatolide by catalyzing methylenedioxy bridge formation and did not act on other possible substrates tested. Interestingly, the enzymes described herein were highly similar to methylenedioxy bridge-forming enzymes from alkaloid biosynthesis, whereas candidates more similar to lignan biosynthetic enzymes were catalytically inactive with the substrates employed. This overall strategy has thus enabled facile further identification of enzymes putatively involved in (−)-podophyllotoxin biosynthesis and underscores the deductive power of next generation sequencing and bioinformatics to probe and deduce medicinal plant biosynthetic pathways. PMID:23161544

  1. Complete amino acid sequence and structure characterization of the taste-modifying protein, miraculin.

    PubMed

    Theerasilp, S; Hitotsuya, H; Nakajo, S; Nakaya, K; Nakamura, Y; Kurihara, Y

    1989-04-25

    The taste-modifying protein, miraculin, has the unusual property of modifying sour taste into sweet taste. The complete amino acid sequence of miraculin purified from miracle fruits by a newly developed method (Theerasilp, S., and Kurihara, Y. (1988) J. Biol. Chem. 263, 11536-11539) was determined by an automatic Edman degradation method. Miraculin was a single polypeptide with 191 amino acid residues. The calculated molecular weight based on the amino acid sequence and the carbohydrate content (13.9%) was 24,600. Asn-42 and Asn-186 were linked N-glycosidically to carbohydrate chains. High homology was found between the amino acid sequences of miraculin and soybean trypsin inhibitor.

  2. Detection and isolation of nucleic acid sequences using a bifunctional hybridization probe

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    2000-01-01

    A method for detecting and isolating a target sequence in a sample of nucleic acids is provided using a bifunctional hybridization probe capable of hybridizing to the target sequence that includes a detectable marker and a first complexing agent capable of forming a binding pair with a second complexing agent. A kit is also provided for detecting a target sequence in a sample of nucleic acids using a bifunctional hybridization probe according to this method.

  3. A prediction of the amino acids and structures involved in DNA recognition by type I DNA restriction and modification enzymes.

    PubMed Central

    Sturrock, S S; Dryden, D T

    1997-01-01

    The S subunits of type I DNA restriction/modification enzymes are responsible for recognising the DNA target sequence for the enzyme. They contain two domains of approximately 150 amino acids, each of which is responsible for recognising one half of the bipartite asymmetric target. In the absence of any known tertiary structure for type I enzymes or recognisable DNA recognition motifs in the highly variable amino acid sequences of the S subunits, it has previously not been possible to predict which amino acids are responsible for sequence recognition. Using a combination of sequence alignment and secondary structure prediction methods to analyse the sequences of S subunits, we predict that all of the 51 known target recognition domains (TRDs) have the same tertiary structure. Furthermore, this structure is similar to the structure of the TRD of the C5-cytosine methyltransferase, Hha I, which recognises its DNA target via interactions with two short polypeptide loops and a beta strand. Our results predict the location of these sequence recognition structures within the TRDs of all type I S subunits. PMID:9254696

  4. Insight into Potential Probiotic Markers Predicted in Lactobacillus pentosus MP-10 Genome Sequence.

    PubMed

    Abriouel, Hikmate; Pérez Montoro, Beatriz; Casimiro-Soriguer, Carlos S; Pérez Pulido, Antonio J; Knapp, Charles W; Caballero Gómez, Natacha; Castillo-Gutiérrez, Sonia; Estudillo-Martínez, María D; Gálvez, Antonio; Benomar, Nabil

    2017-01-01

    Lactobacillus pentosus MP-10 is a potential probiotic lactic acid bacterium originally isolated from naturally fermented Aloreña green table olives. The entire genome sequence was annotated to in silico analyze the molecular mechanisms involved in the adaptation of L. pentosus MP-10 to the human gastrointestinal tract (GIT), such as carbohydrate metabolism (related with prebiotic utilization) and the proteins involved in bacteria-host interactions. We predicted an arsenal of genes coding for carbohydrate-modifying enzymes to modify oligo- and polysaccharides, such as glycoside hydrolases, glycoside transferases, and isomerases, and other enzymes involved in complex carbohydrate metabolism especially starch, raffinose, and levan. These enzymes represent key indicators of the bacteria's adaptation to the GIT environment, since they involve the metabolism and assimilation of complex carbohydrates not digested by human enzymes. We also detected key probiotic ligands (surface proteins, excreted or secreted proteins) involved in the adhesion to host cells such as adhesion to mucus, epithelial cells or extracellular matrix, and plasma components; also, moonlighting proteins or multifunctional proteins were found that could be involved in adhesion to epithelial cells and/or extracellular matrix proteins and also affect host immunomodulation. In silico analysis of the genome sequence of L. pentosus MP-10 is an important initial step to screen for genes encoding for proteins that may provide probiotic features, and thus provides one new routes for screening and studying this potentially probiotic bacterium.

  5. Insight into Potential Probiotic Markers Predicted in Lactobacillus pentosus MP-10 Genome Sequence

    PubMed Central

    Abriouel, Hikmate; Pérez Montoro, Beatriz; Casimiro-Soriguer, Carlos S.; Pérez Pulido, Antonio J.; Knapp, Charles W.; Caballero Gómez, Natacha; Castillo-Gutiérrez, Sonia; Estudillo-Martínez, María D.; Gálvez, Antonio; Benomar, Nabil

    2017-01-01

    Lactobacillus pentosus MP-10 is a potential probiotic lactic acid bacterium originally isolated from naturally fermented Aloreña green table olives. The entire genome sequence was annotated to in silico analyze the molecular mechanisms involved in the adaptation of L. pentosus MP-10 to the human gastrointestinal tract (GIT), such as carbohydrate metabolism (related with prebiotic utilization) and the proteins involved in bacteria–host interactions. We predicted an arsenal of genes coding for carbohydrate-modifying enzymes to modify oligo- and polysaccharides, such as glycoside hydrolases, glycoside transferases, and isomerases, and other enzymes involved in complex carbohydrate metabolism especially starch, raffinose, and levan. These enzymes represent key indicators of the bacteria’s adaptation to the GIT environment, since they involve the metabolism and assimilation of complex carbohydrates not digested by human enzymes. We also detected key probiotic ligands (surface proteins, excreted or secreted proteins) involved in the adhesion to host cells such as adhesion to mucus, epithelial cells or extracellular matrix, and plasma components; also, moonlighting proteins or multifunctional proteins were found that could be involved in adhesion to epithelial cells and/or extracellular matrix proteins and also affect host immunomodulation. In silico analysis of the genome sequence of L. pentosus MP-10 is an important initial step to screen for genes encoding for proteins that may provide probiotic features, and thus provides one new routes for screening and studying this potentially probiotic bacterium. PMID:28588563

  6. Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification

    PubMed Central

    Royce, Thomas E.; Rozowsky, Joel S.; Gerstein, Mark B.

    2007-01-01

    A generic DNA microarray design applicable to any species would greatly benefit comparative genomics. We have addressed the feasibility of such a design by leveraging the great feature densities and relatively unbiased nature of genomic tiling microarrays. Specifically, we first divided each Homo sapiens Refseq-derived gene's spliced nucleotide sequence into all of its possible contiguous 25 nt subsequences. For each of these 25 nt subsequences, we searched a recent human transcript mapping experiment's probe design for the 25 nt probe sequence having the fewest mismatches with the subsequence, but that did not match the subsequence exactly. Signal intensities measured with each gene's nearest-neighbor features were subsequently averaged to predict their gene expression levels in each of the experiment's thirty-three hybridizations. We examined the fidelity of this approach in terms of both sensitivity and specificity for detecting actively transcribed genes, for transcriptional consistency between exons of the same gene, and for reproducibility between tiling array designs. Taken together, our results provide proof-of-principle for probing nucleic acid targets with off-target, nearest-neighbor features. PMID:17686789

  7. Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection.

    PubMed

    Pan, Xiao-Yong; Shen, Hong-Bin

    2009-01-01

    B-factor is highly correlated with protein internal motion, which is used to measure the uncertainty in the position of an atom within a crystal structure. Although the rapid progress of structural biology in recent years makes more accurate protein structures available than ever, with the avalanche of new protein sequences emerging during the post-genomic Era, the gap between the known protein sequences and the known protein structures becomes wider and wider. It is urgent to develop automated methods to predict B-factor profile from the amino acid sequences directly, so as to be able to timely utilize them for basic research. In this article, we propose a novel approach, called PredBF, to predict the real value of B-factor. We firstly extract both global and local features from the protein sequences as well as their evolution information, then the random forests feature selection is applied to rank their importance and the most important features are inputted to a two-stage support vector regression (SVR) for prediction, where the initial predicted outputs from the 1(st) SVR are further inputted to the 2nd layer SVR for final refinement. Our results have revealed that a systematic analysis of the importance of different features makes us have deep insights into the different contributions of features and is very necessary for developing effective B-factor prediction tools. The two-layer SVR prediction model designed in this study further enhanced the robustness of predicting the B-factor profile. As a web server, PredBF is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/PredBF for academic use.

  8. Complete amino acid sequence of a Lolium perenne (perennial rye grass) pollen allergen, Lol p II.

    PubMed

    Ansari, A A; Shenbagamurthi, P; Marsh, D G

    1989-07-05

    The complete amino acid sequence of a Lolium perenne (rye grass) pollen allergen, Lol p II was determined by automated Edman degradation of the protein and selected fragments. Cleavage of the protein by enzymatic and chemical techniques established an unambiguous sequence for the protein. Lol p II contains 97 amino acid residues, with a calculated molecular weight of 10,882. The protein lacks cysteine and glutamine and shows no evidence of glycosylation. Theoretical predictions by Fraga's (Fraga, S. (1982) Can. J. Chem. 60, 2606-2610) and Hopp and Woods' (Hopp, T. P., and Woods, K. R. (1981) Proc. Natl. Acad. Sci. U.S.A. 78, 3824-3828) methods indicate the presence of four hydrophilic regions, which may contribute to sequential or parts of conformational B-cell epitopes. Analysis of amphipathic regions by Berzofsky's method indicates the presence of a highly amphipathic region, which may contain, or contribute to, an Ia/T-cell epitope. This latter segment of Lol p II was found to be highly homologous with an antibody-binding segment of the major rye allergen Lol p I and may explain why immune responsiveness to both the allergens is associated with HLA-DR3.

  9. Prediction of Out-of-Sequence Development by BSID Scores.

    ERIC Educational Resources Information Center

    Richards, Ruth C.; And Others

    The primary purpose of this study was to examine uneven early development in premature infants. A multiple regression analysis was performed in which birth weight, length of gestation, length of assisted feeding, and length of ventilation were used to predict the descrepancy between a child's Psychomotor and Mental Scale scores on the Bayley…

  10. Structure Prediction and Analysis of Neuraminidase Sequence Variants

    ERIC Educational Resources Information Center

    Thayer, Kelly M.

    2016-01-01

    Analyzing protein structure has become an integral aspect of understanding systems of biochemical import. The laboratory experiment endeavors to introduce protein folding to ascertain structures of proteins for which the structure is unavailable, as well as to critically evaluate the quality of the prediction obtained. The model system used is the…

  11. Structure Prediction and Analysis of Neuraminidase Sequence Variants

    ERIC Educational Resources Information Center

    Thayer, Kelly M.

    2016-01-01

    Analyzing protein structure has become an integral aspect of understanding systems of biochemical import. The laboratory experiment endeavors to introduce protein folding to ascertain structures of proteins for which the structure is unavailable, as well as to critically evaluate the quality of the prediction obtained. The model system used is the…

  12. Prediction of Out-of-Sequence Development by BSID Scores.

    ERIC Educational Resources Information Center

    Richards, Ruth C.; And Others

    The primary purpose of this study was to examine uneven early development in premature infants. A multiple regression analysis was performed in which birth weight, length of gestation, length of assisted feeding, and length of ventilation were used to predict the descrepancy between a child's Psychomotor and Mental Scale scores on the Bayley…

  13. Predicting nucleic acid binding interfaces from structural models of proteins

    PubMed Central

    Dror, Iris; Shazman, Shula; Mukherjee, Srayanta; Zhang, Yang; Glaser, Fabian; Mandel-Gutfreund, Yael

    2011-01-01

    The function of DNA- and RNA-binding proteins can be inferred from the characterization and accurate prediction of their binding interfaces. However the main pitfall of various structure-based methods for predicting nucleic acid binding function is that they are all limited to a relatively small number of proteins for which high-resolution three dimensional structures are available. In this study, we developed a pipeline for extracting functional electrostatic patches from surfaces of protein structural models, obtained using the I-TASSER protein structure predictor. The largest positive patches are extracted from the protein surface using the patchfinder algorithm. We show that functional electrostatic patches extracted from an ensemble of structural models highly overlap the patches extracted from high-resolution structures. Furthermore, by testing our pipeline on a set of 55 known nucleic acid binding proteins for which I-TASSER produces high-quality models, we show that the method accurately identifies the nucleic acids binding interface on structural models of proteins. Employing a combined patch approach we show that patches extracted from an ensemble of models better predicts the real nucleic acid binding interfaces compared to patches extracted from independent models. Overall, these results suggest that combining information from a collection of low-resolution structural models could be a valuable approach for functional annotation. We suggest that our method will be further applicable for predicting other functional surfaces of proteins with unknown structure. PMID:22086767

  14. Predicting nucleic acid binding interfaces from structural models of proteins.

    PubMed

    Dror, Iris; Shazman, Shula; Mukherjee, Srayanta; Zhang, Yang; Glaser, Fabian; Mandel-Gutfreund, Yael

    2012-02-01

    The function of DNA- and RNA-binding proteins can be inferred from the characterization and accurate prediction of their binding interfaces. However, the main pitfall of various structure-based methods for predicting nucleic acid binding function is that they are all limited to a relatively small number of proteins for which high-resolution three-dimensional structures are available. In this study, we developed a pipeline for extracting functional electrostatic patches from surfaces of protein structural models, obtained using the I-TASSER protein structure predictor. The largest positive patches are extracted from the protein surface using the patchfinder algorithm. We show that functional electrostatic patches extracted from an ensemble of structural models highly overlap the patches extracted from high-resolution structures. Furthermore, by testing our pipeline on a set of 55 known nucleic acid binding proteins for which I-TASSER produces high-quality models, we show that the method accurately identifies the nucleic acids binding interface on structural models of proteins. Employing a combined patch approach we show that patches extracted from an ensemble of models better predicts the real nucleic acid binding interfaces compared with patches extracted from independent models. Overall, these results suggest that combining information from a collection of low-resolution structural models could be a valuable approach for functional annotation. We suggest that our method will be further applicable for predicting other functional surfaces of proteins with unknown structure. Copyright © 2011 Wiley Periodicals, Inc.

  15. Urinary intestinal fatty acid binding protein predicts necrotizing enterocolitis.

    PubMed

    Gregory, Katherine E; Winston, Abigail B; Yamamoto, Hidemi S; Dawood, Hassan Y; Fashemi, Titilayo; Fichorova, Raina N; Van Marter, Linda J

    2014-06-01

    Necrotizing enterocolitis, characterized by sudden onset and rapid progression, remains the most significant gastrointestinal disorder among premature infants. In seeking a predictive biomarker, we found intestinal fatty acid binding protein, an indicator of enterocyte damage, was substantially increased within three and seven days before the diagnosis of necrotizing enterocolitis.

  16. Sequence conservation predicts T cell reactivity against ragweed allergens

    PubMed Central

    Pham, John; Oseroff, Carla; Hinz, Denise; Sidney, John; Paul, Sinu; Greenbaum, Jason; Vita, Randi; Phillips, Elizabeth; Mallal, Simon; Peters, Bjoern; Sette, Alessandro

    2016-01-01

    Background Ragweed is a major cause of seasonal allergy, affecting millions of people worldwide. Several allergens have been defined based on IgE reactivity, but their relative immunogenicity in terms of T cell responses has not been studied. Objective We comprehensively characterized T cell responses from atopic, ragweed-allergic subjects to Amb a 1, Amb a 3, Amb a 4, Amb a 5, Amb a 6, Amb a 8, Amb a 9, Amb a 10, Amb a 11, and Amb p 5, and examined their correlation with serological reactivity and sequence conservation in other allergens. Methods Peripheral blood mononuclear cells (PBMCs) from donors positive for IgE toward ragweed extracts after in vitro expansion for secretion of IL-5 (a representative Th2 cytokine) and IFNγ (Th1) in response to a panel of overlapping peptides spanning the above listed allergens. Results Three previously identified dominant T cell epitopes (Amb a 1 176–191, 200–215, and 344–359) were confirmed and three novel dominant epitopes (Amb a 1 280–295, 304–319, and 320–335) were identified. Amb a 1, the dominant IgE allergen, was also the dominant T cell allergen, but dominance patterns for T cell and IgE responses for the other ragweed allergens did not correlate. Dominance for T cell responses correlated with conservation of ragweed epitopes with sequences of other well-known allergens. Conclusion and clinical relevance These results provide the first assessment of the hierarchy of T cell reactivity in ragweed allergens, which is distinct from that observed for IgE reactivity and influenced by T cell epitope sequence conservation. The results suggest that ragweed allergens associated with lesser IgE reactivity and significant T cell reactivity may be targeted for T cell immunotherapy, and further support the development of immunotherapies against epitopes conserved across species to generate broad reactivity against many common allergens. PMID:27359111

  17. Amino acid sequences of alpha-helical segments from S-carbosymethylkerateine-A. Complete sequence of a type-I segment.

    PubMed Central

    Gough, K H; Inglis, A S; Crewther, W G

    1978-01-01

    The amino acid sequence of a type-I helical segment from the low-sulphur protein (S-carboxymethylkerateine-A) of wool was determined by combining automatic and manual-sequencing data. Whereas in the type-II helical segment most of the cationic groups occur in pairs, 11 of the 22 anionic residues in the sequence of the type-I segment were situated next to a second anionic residue. This suggests possible interactions between type-I and type-II helical segments in alpha-keratin. As observed with the sequence of a type-II helical segment a model constructed on 3.6 residues per turn of helix shows a line of hydrophobic residues along the helix, thereby supporting the physicochemical evidence that the molecule is predominantly helical and forms part of a coiled-coil structure. Examination of the sequence data by predictive methods indicates the possibilty of extensive sections of alpha-helix interspersed with discontinuities. The molecule contains a number of regions with peptide sequences identical with those found by other workers after enzymic digestion of fractions from oxidized wool. Images Fig. 1. PMID:697725

  18. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.

    PubMed

    Yin, Changchuan

    2015-04-01

    To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.

  19. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences

    PubMed Central

    2012-01-01

    Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

  20. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues.

    PubMed

    Yan, Jing; Kurgan, Lukasz

    2017-06-02

    Protein-DNA and protein-RNA interactions are part of many diverse and essential cellular functions and yet most of them remain to be discovered and characterized. Recent research shows that sequence-based predictors of DNA-binding residues accurately find these residues but also cross-predict many RNA-binding residues as DNA-binding, and vice versa. Most of these methods are also relatively slow, prohibiting applications on the whole-genome scale. We describe a novel sequence-based method, DRNApred, which accurately and in high-throughput predicts and discriminates between DNA- and RNA-binding residues. DRNApred was designed using a new dataset with both DNA- and RNA-binding proteins, regression that penalizes cross-predictions, and a novel two-layered architecture. DRNApred outperforms state-of-the-art predictors of DNA- or RNA-binding residues on a benchmark test dataset by substantially reducing the cross predictions and predicting arguably higher quality false positives that are located nearby the native binding residues. Moreover, it also more accurately predicts the DNA- and RNA-binding proteins. Application on the human proteome confirms that DRNApred reduces the cross predictions among the native nucleic acid binders. Also, novel putative DNA/RNA-binding proteins that it predicts share similar subcellular locations and residue charge profiles with the known native binding proteins. Webserver of DRNApred is freely available at http://biomine.cs.vcu.edu/servers/DRNApred/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  1. DNA sequencing and predictions of the cosmic theory of life

    NASA Astrophysics Data System (ADS)

    Wickramasinghe, N. Chandra

    2013-01-01

    The theory of cometary panspermia, developed by the late Sir Fred Hoyle and the present author argues that life originated cosmically as a unique event in one of a great multitude of comets or planetary bodies in the Universe. Life on Earth did not originate here but was introduced by impacting comets, and its further evolution was driven by the subsequent acquisition of cosmically derived genes. Explicit predictions of this theory published in 1979-1981, stating how the acquisition of new genes drives evolution, are compared with recent developments in relation to horizontal gene transfer, and the role of retroviruses in evolution. Precisely-stated predictions of the theory of cometary panspermia are shown to have been verified.

  2. DNA Sequencing and Predictions of the Cosmic Theory of Life

    NASA Astrophysics Data System (ADS)

    Wickramasinghe, N. Chandra

    The theory of cometary panspermia, developed by the late Sir Fred Hoyle and the present author argues that life originated cosmically as a unique event in one of a great multitude of comets or planetary bodies in the Universe. Life on Earth did not originate here but was introduced by impacting comets, and its further evolution was driven by the subsequent acquisition of cosmically derived genes. Explicit predictions of this theory published in 1979-1981, stating how the acquisition of new genes drives evolution, are compared with recent developments in relation to horizontal gene transfer, and the role of retroviruses in evolution. Precisely-stated predictions of the theory of cometary panspermia are shown to have been verified.

  3. Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction.

    PubMed

    Huang, Ying; Chen, Shi-Yi; Deng, Feilong

    2016-01-01

    In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.

  4. Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests

    PubMed Central

    Šikić, Mile; Tomić, Sanja; Vlahoviček, Kristian

    2009-01-01

    Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i) a combination of sequence- and structure-derived parameters and (ii) sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras–Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information. PMID:19180183

  5. ABRF ESRG 2005 Study: Identification of Seven Modified Amino Acids by Edman Sequencing

    PubMed Central

    Brune, D.; Denslow, N.D.; Kobayashi, R.; Lane, W.S.; Leone, J.W.; Madden, B.J.; Neveu, J. M.; Pohl, J.

    2006-01-01

    Identification of modified amino acids can be a challenging part for Edman degradation sequence analysis, largely because they are not included among the commonly used phenylthiohydantion amino acid standards. Yet many can have unique retention times and can be assigned by an experienced researcher or through the use of a guide showing their typical chromatography characteristics. The Edman Sequencing Research Group (ESRG) 2005 study is a continuation of the 2004 study, in which the participating laboratories were provided a synthetic peptide and asked to identify the modified amino acids present in the sequence. The study sample provided an opportunity to sequence a peptide containing a variety of modified amino acids and note their retention times relative to the common amino acids. It also allowed the ESRG to compile the chromatographic properties and intensities from multiple instruments and tabulate an average elution position for these modified amino acids on commonly used instruments. Participating laboratories were given 2000 pmoles of a synthetic peptide, 18 amino acids long, containing the following modified amino acids: dimethyl- and trimethyl-lysine, 3-methyl-histidine, N-carbamyl-lysine, cystine, N-methyl-alanine, and isoaspartic acid. The modified amino acids were interspersed with standard amino acids to help in the assessment of initial and repetitive yields. In addition to filling in an assignment sheet, which included retention times and peak areas, participants were asked to provide specific details about the parameters used for the sequencing run. References for some of the modified amino acid elution characteristics were provided and the participants had the option of viewing a list of the modified amino acids present in the peptide at the ESRG Web site. The ABRF ESRG 2005 sample is the seventeenth in a series of studies designed to aid laboratories in evaluating their abilities to obtain and interpret amino acid sequence data. PMID:17122064

  6. Prediction of G protein-coupled receptor encoding sequences from the synganglion transcriptome of the cattle tick, Rhipicephalus microplus.

    PubMed

    Guerrero, Felix D; Kellogg, Anastasia; Ogrey, Alexandria N; Heekin, Andrew M; Barrero, Roberto; Bellgard, Matthew I; Dowd, Scot E; Leung, Ming-Ying

    2016-07-01

    The cattle tick, Rhipicephalus (Boophilus) microplus, is a pest which causes multiple health complications in cattle. The G protein-coupled receptor (GPCR) super-family presents a candidate target for developing novel tick control methods. However, GPCRs share limited sequence similarity among orthologous family members, and there is no reference genome available for R. microplus. This limits the effectiveness of alignment-dependent methods such as BLAST and Pfam for identifying GPCRs from R. microplus. However, GPCRs share a common structure consisting of seven transmembrane helices. We present an analysis of the R. microplus synganglion transcriptome using a combination of structurally-based and alignment-free methods which supplement the identification of GPCRs by sequence similarity. TMHMM predicts the number of transmembrane helices in a protein sequence. GPCRpred is a support vector machine-based method developed to predict and classify GPCRs using the dipeptide composition of a query amino acid sequence. These two bioinformatic tools were applied to our transcriptome assembly of the cattle tick synganglion. Together, BLAST and Pfam identified 85 unique contigs as encoding partial or full length candidate cattle tick GPCRs. Collectively, TMHMM and GPCRpred identified 27 additional GPCR candidates that BLAST and Pfam missed. This demonstrates that the addition of structurally-based and alignment-free bioinformatic approaches to transcriptome annotation and analysis produces a greater collection of prospective GPCRs than an analysis based solely upon methodologies dependent upon sequence alignment and similarity. Published by Elsevier GmbH.

  7. Predicted philogeny, secondary conformational structure, and epitope antigenicity of immunological sequences in poultry.

    PubMed

    Lara, L J; Peconick, A P; Fassani, É J; Júnior, A M P; Chalfun, P R B; Raymundo, D L; Barçante, T A; Barçante, J M de P

    2017-05-18

    Poultry production is faced with different types of stresses that are responsible for issues of animal welfare as well as for economic losses. Moreover, the immunity decreases when animals are stressed. In silico analyses are important in reducing the cost and in increasing the accuracy of scientific results. A bioinformatics tool was used to perform ontology studies on 15 different immunological sequences of poultry. The mRNA structures and sequences with maximum antigenic residues were also predicted. No homology was found between the sequences of poultry and mammals. These results helped in the prediction of new potential molecular markers. Of the 15 sequences that were analyzed, predictions could not be made for five because they were longer than 2500 nucleotides; for the remaining 10 sequences, 20 conformational structures per sequence were predicted and the most stable sequences were identified by their minimum free energy values. The highest antigenic epitopes were accepted by the maximum scores; 15 of the total 8934 epitopes that were predicted were analyzed. These results would aid future studies that use synthetic peptides or recombinants as markers or immunomodulators and would expand our understanding on how stress can modulate the immune system. These would also help in developing rapid diagnostic tools, in increasing animal welfare, biosecurity, and productivity, and also in developing of food additives and environmental enrichment for stress control, thereby, making animal production more sustainable.

  8. Computational prediction of the tolerance to amino-acid deletion in green-fluorescent protein

    PubMed Central

    Jackson, Eleisha L.; Spielman, Stephanie J.

    2017-01-01

    Proteins evolve through two primary mechanisms: substitution, where mutations alter a protein’s amino-acid sequence, and insertions and deletions (indels), where amino acids are either added to or removed from the sequence. Protein structure has been shown to influence the rate at which substitutions accumulate across sites in proteins, but whether structure similarly constrains the occurrence of indels has not been rigorously studied. Here, we investigate the extent to which structural properties known to covary with protein evolutionary rates might also predict protein tolerance to indels. Specifically, we analyze a publicly available dataset of single—amino-acid deletion mutations in enhanced green fluorescent protein (eGFP) to assess how well the functional effect of deletions can be predicted from protein structure. We find that weighted contact number (WCN), which measures how densely packed a residue is within the protein’s three-dimensional structure, provides the best single predictor for whether eGFP will tolerate a given deletion. We additionally find that using protein design to explicitly model deletions results in improved predictions of functional status when combined with other structural predictors. Our work suggests that structure plays fundamental role in constraining deletions at sites in proteins, and further that similar biophysical constraints influence both substitutions and deletions. This study therefore provides a solid foundation for future work to examine how protein structure influences tolerance of more complex indel events, such as insertions or large deletions. PMID:28369116

  9. Three Dimensional Structure Prediction of Fatty Acid Binding Site on Human Transmembrane Receptor CD36.

    PubMed

    Tarhda, Zineb; Semlali, Oussama; Kettani, Anas; Moussa, Ahmed; Abumrad, Nada A; Ibrahimi, Azeddine

    2013-01-01

    CD36 is an integral membrane protein which is thought to have a hairpin-like structure with alpha-helices at the C and N terminals projecting through the membrane as well as a larger extracellular loop. This receptor interacts with a number of ligands including oxidized low density lipoprotein and long chain fatty acids (LCFAs). It is also implicated in lipid metabolism and heart diseases. It is therefore important to determine the 3D structure of the CD36 site involved in lipid binding. In this study, we predict the 3D structure of the fatty acid (FA) binding site [127-279 aa] of the CD36 receptor based on homology modeling with X-ray structure of Human Muscle Fatty Acid Binding Protein (PDB code: 1HMT). Qualitative and quantitative analysis of the resulting model suggests that this model was reliable and stable, taking in consideration over 97.8% of the residues in the most favored regions as well as the significant overall quality factor. Protein analysis, which relied on the secondary structure prediction of the target sequence and the comparison of 1HMT and CD36 [127-279 aa] secondary structures, led to the determination of the amino acid sequence consensus. These results also led to the identification of the functional sites on CD36 and revealed the presence of residues which may play a major role during ligand-protein interactions.

  10. Three Dimensional Structure Prediction of Fatty Acid Binding Site on Human Transmembrane Receptor CD36

    PubMed Central

    Tarhda, Zineb; Semlali, Oussama; Kettani, Anas; Moussa, Ahmed; Abumrad, Nada A.; Ibrahimi, Azeddine

    2013-01-01

    CD36 is an integral membrane protein which is thought to have a hairpin-like structure with alpha-helices at the C and N terminals projecting through the membrane as well as a larger extracellular loop. This receptor interacts with a number of ligands including oxidized low density lipoprotein and long chain fatty acids (LCFAs). It is also implicated in lipid metabolism and heart diseases. It is therefore important to determine the 3D structure of the CD36 site involved in lipid binding. In this study, we predict the 3D structure of the fatty acid (FA) binding site [127–279 aa] of the CD36 receptor based on homology modeling with X-ray structure of Human Muscle Fatty Acid Binding Protein (PDB code: 1HMT). Qualitative and quantitative analysis of the resulting model suggests that this model was reliable and stable, taking in consideration over 97.8% of the residues in the most favored regions as well as the significant overall quality factor. Protein analysis, which relied on the secondary structure prediction of the target sequence and the comparison of 1HMT and CD36 [127–279 aa] secondary structures, led to the determination of the amino acid sequence consensus. These results also led to the identification of the functional sites on CD36 and revealed the presence of residues which may play a major role during ligand-protein interactions. PMID:24348024

  11. Identification of Nucleic Acid High Affinity Binding Sequences of Proteins by SELEX.

    PubMed

    Bouvet, Philippe

    2015-01-01

    A technique is described for the identification of nucleic acid sequences bound with high affinity by proteins or by other molecules suitable for a partitioning assay. Here, a histidine-tagged protein is allowed to interact with a pool of nucleic acids and the protein-nucleic acid complexes formed are retained on a Ni-NTA matrix. Nucleic acids with a low level of recognition by the protein are washed away. The pool of recovered nucleic acids is amplified by the polymerase chain reaction and is submitted to further rounds of selection. Each round of selection increases the proportion of sequences that are avidly bound by the protein of interest. The cloning and sequencing of these sequences finally completes their identification.

  12. Identification of nucleic acid high-affinity binding sequences of proteins by SELEX.

    PubMed

    Bouvet, Philippe

    2009-01-01

    A technique is described for the identification of nucleic acid sequences bound with high affinity by proteins or by other molecules suitable for a partitioning assay. Here, a histidine-tagged protein is allowed to interact with a pool of nucleic acids and the protein-nucleic acid complexes formed are retained on a Ni-NTA matrix. Nucleic acids with a low level of recognition by the protein are washed away. The pool of recovered nucleic acids is amplified by the polymerase chain reaction and is submitted to further rounds of selection. Each round of selection increases the proportion of sequences that are avidly bound by the protein of interest. The cloning and sequencing of these sequences finally completes their identification.

  13. Trichomonas vaginalis acidic phospholipase A2: isolation and partial amino acid sequence.

    PubMed

    Escobedo-Guajardo, Brenda L; González-Salazar, Francisco; Palacios-Corona, Rebeca; Torres de la Cruz, Víctor M; Morales-Vallarta, Mario; Mata-Cárdenas, Benito D; Garza-González, Jesús N; Rivera-Silva, Gerardo; Vargas-Villarreal, Javier

    2013-12-01

    Sexually transmitted diseases are a major cause of acute disease worldwide, and trichomoniasis is the most common and curable disease, generating more than 170 million cases annually worldwide. Trichomonas vaginalis is the causal agent of trichomoniasis and has the ability to destroy in vitro cell monolayers of the vaginal mucosa, where the phospholipases A2 (PLA2) have been reported as potential virulence factors. These enzymes have been partially characterized from the subcellular fraction S30 of pathogenic T. vaginalis strains. The main objective of this study was to purify a phospholipase A2 from T. vaginalis, make a partial characterization, obtain a partial amino acid sequence, and determine its enzymatic participation as hemolytic factor causing lysis of erythrocytes. Trichomonas S30, RF30 and UFF30 sub-fractions from GT-15 strain have the capacity to hydrolyze [2-(14)C-PA]-PC at pH 6.0. Proteins from the UFF30 sub-fraction were separated by affinity chromatography into two eluted fractions with detectable PLA A2 activity. The EDTA-eluted fraction was analyzed by HPLC using on-line HPLC-tandem mass spectrometry and two protein peaks were observed at 8.2 and 13 kDa. Peptide sequences were identified from the proteins present in the eluted EDTA UFF30 fraction; bioinformatic analysis using Protein Link Global Server charged with T. vaginalis protein database suggests that eluted peptides correspond a putative ubiquitin protein in the 8.2 kDa fraction and a phospholipase preserved in the 13 kDa fraction. The EDTA-eluted fraction hydrolyzed [2-(14)C-PA]-PC lyses erythrocytes from Sprague-Dawley in a time and dose-dependent manner. The acidic hemolytic activity decreased by 84% with the addition of 100 μM of Rosenthal's inhibitor.

  14. Improved Prediction of Non-methylated Islands in Vertebrates Highlights Different Characteristic Sequence Patterns

    PubMed Central

    Vingron, Martin

    2016-01-01

    Non-methylated islands (NMIs) of DNA are genomic regions that are important for gene regulation and development. A recent study of genome-wide non-methylation data in vertebrates by Long et al. (eLife 2013;2:e00348) has shown that many experimentally identified non-methylated regions do not overlap with classically defined CpG islands which are computationally predicted using simple DNA sequence features. This is especially true in cold-blooded vertebrates such as Danio rerio (zebrafish). In order to investigate how predictive DNA sequence is of a region’s methylation status, we applied a supervised learning approach using a spectrum kernel support vector machine, to see if a more complex model and supervised learning can be used to improve non-methylated island prediction and to understand the sequence properties of these regions. We demonstrate that DNA sequence is highly predictive of methylation status, and that in contrast to existing CpG island prediction methods our method is able to provide more useful predictions of NMIs genome-wide in all vertebrate organisms that were studied. Our results also show that in cold-blooded vertebrates (Anolis carolinensis, Xenopus tropicalis and Danio rerio) where genome-wide classical CpG island predictions consist primarily of false positives, longer primarily AT-rich DNA sequence features are able to identify these regions much more accurately. PMID:27984582

  15. Frequencies of amino acid strings in globular protein sequences indicate suppression of blocks of consecutive hydrophobic residues

    PubMed Central

    Schwartz, Russell; Istrail, Sorin; King, Jonathan

    2001-01-01

    Patterns of hydrophobic and hydrophilic residues play a major role in protein folding and function. Long, predominantly hydrophobic strings of 20–22 amino acids each are associated with transmembrane helices and have been used to identify such sequences. Much less attention has been paid to hydrophobic sequences within globular proteins. In prior work on computer simulations of the competition between on-pathway folding and off-pathway aggregate formation, we found that long sequences of consecutive hydrophobic residues promoted aggregation within the model, even controlling for overall hydrophobic content. We report here on an analysis of the frequencies of different lengths of contiguous blocks of hydrophobic residues in a database of amino acid sequences of proteins of known structure. Sequences of three or more consecutive hydrophobic residues are found to be significantly less common in actual globular proteins than would be predicted if residues were selected independently. The result may reflect selection against long blocks of hydrophobic residues within globular proteins relative to what would be expected if residue hydrophobicities were independent of those of nearby residues in the sequence. PMID:11316883

  16. Sequence-Specific Covalent Capture Coupled with High-Contrast Nanopore Detection of a Disease-Derived Nucleic Acid Sequence.

    PubMed

    Nejad, Maryam Imani; Shi, Ruicheng; Zhang, Xinyue; Gu, Li-Qun; Gates, Kent S

    2017-07-18

    Hybridization-based methods for the detection of nucleic acid sequences are important in research and medicine. Short probes provide sequence specificity, but do not always provide a durable signal. Sequence-specific covalent crosslink formation can anchor probes to target DNA and might also provide an additional layer of target selectivity. Here, we developed a new crosslinking reaction for the covalent capture of specific nucleic acid sequences. This process involved reaction of an abasic (Ap) site in a probe strand with an adenine residue in the target strand and was used for the detection of a disease-relevant T→A mutation at position 1799 of the human BRAF kinase gene sequence. Ap-containing probes were easily prepared and displayed excellent specificity for the mutant sequence under isothermal assay conditions. It was further shown that nanopore technology provides a high contrast-in essence, digital-signal that enables sensitive, single-molecule sensing of the cross-linked duplexes. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  17. Identification of random nucleic acid sequence aberrations using dual capture probes which hybridize to different chromosome regions

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    1998-01-01

    A method is provided for detecting nucleic acid sequence aberrations using two immobilization steps. According to the method, a nucleic acid sequence aberration is detected by detecting nucleic acid sequences having both a first nucleic acid sequence type (e.g., from a first chromosome) and a second nucleic acid sequence type (e.g., from a second chromosome), the presence of the first and the second nucleic acid sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. In the method, immobilization of a first hybridization probe is used to isolate a first set of nucleic acids in the sample which contain the first nucleic acid sequence type. Immobilization of a second hybridization probe is then used to isolate a second set of nucleic acids from within the first set of nucleic acids which contain the second nucleic acid sequence type. The second set of nucleic acids are then detected, their presence indicating the presence of a nucleic acid sequence aberration.

  18. Identification of random nucleic acid sequence aberrations using dual capture probes which hybridize to different chromosome regions

    DOEpatents

    Lucas, J.N.; Straume, T.; Bogen, K.T.

    1998-03-24

    A method is provided for detecting nucleic acid sequence aberrations using two immobilization steps. According to the method, a nucleic acid sequence aberration is detected by detecting nucleic acid sequences having both a first nucleic acid sequence type (e.g., from a first chromosome) and a second nucleic acid sequence type (e.g., from a second chromosome), the presence of the first and the second nucleic acid sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. In the method, immobilization of a first hybridization probe is used to isolate a first set of nucleic acids in the sample which contain the first nucleic acid sequence type. Immobilization of a second hybridization probe is then used to isolate a second set of nucleic acids from within the first set of nucleic acids which contain the second nucleic acid sequence type. The second set of nucleic acids are then detected, their presence indicating the presence of a nucleic acid sequence aberration. 14 figs.

  19. Affinity regression predicts the recognition code of nucleic acid binding proteins

    PubMed Central

    Pelossof, Raphael; Singh, Irtisha; Yang, Julie L.; Weirauch, Matthew T.; Hughes, Timothy R.; Leslie, Christina S.

    2016-01-01

    Predicting the affinity profiles of nucleic acid-binding proteins directly from the protein sequence is a major unsolved problem. We present a statistical approach for learning the recognition code of a family of transcription factors (TFs) or RNA-binding proteins (RBPs) from high-throughput binding assays. Our method, called affinity regression, trains on protein binding microarray (PBM) or RNA compete experiments to learn an interaction model between proteins and nucleic acids, using only protein domain and probe sequences as inputs. By training on mouse homeodomain PBM profiles, our model correctly identifies residues that confer DNA-binding specificity and accurately predicts binding motifs for an independent set of divergent homeodomains. Similarly, learning from RNA compete profiles for diverse RBPs, our model can predict the binding affinities of held-out proteins and identify key RNA-binding residues. More broadly, we envision applying our method to model and predict biological interactions in any setting where there is a high-throughput ‘affinity’ readout. PMID:26571099

  20. Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle.

    PubMed

    van Binsbergen, Rianne; Calus, Mario P L; Bink, Marco C A M; van Eeuwijk, Fred A; Schrooten, Chris; Veerkamp, Roel F

    2015-09-17

    In contrast to currently used single nucleotide polymorphism (SNP) panels, the use of whole-genome sequence data is expected to enable the direct estimation of the effects of causal mutations on a given trait. This could lead to higher reliabilities of genomic predictions compared to those based on SNP genotypes. Also, at each generation of selection, recombination events between a SNP and a mutation can cause decay in reliability of genomic predictions based on markers rather than on the causal variants. Our objective was to investigate the use of imputed whole-genome sequence genotypes versus high-density SNP genotypes on (the persistency of) the reliability of genomic predictions using real cattle data. Highly accurate phenotypes based on daughter performance and Illumina BovineHD Beadchip genotypes were available for 5503 Holstein Friesian bulls. The BovineHD genotypes (631,428 SNPs) of each bull were used to impute whole-genome sequence genotypes (12,590,056 SNPs) using the Beagle software. Imputation was done using a multi-breed reference panel of 429 sequenced individuals. Genomic estimated breeding values for three traits were predicted using a Bayesian stochastic search variable selection (BSSVS) model and a genome-enabled best linear unbiased prediction model (GBLUP). Reliabilities of predictions were based on 2087 validation bulls, while the other 3416 bulls were used for training. Prediction reliabilities ranged from 0.37 to 0.52. BSSVS performed better than GBLUP in all cases. Reliabilities of genomic predictions were slightly lower with imputed sequence data than with BovineHD chip data. Also, the reliabilities tended to be lower for both sequence data and BovineHD chip data when relationships between training animals were low. No increase in persistency of prediction reliability using imputed sequence data was observed. Compared to BovineHD genotype data, using imputed sequence data for genomic prediction produced no advantage. To investigate the

  1. The amino acid sequence of protein CM-3 from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J

    1985-01-01

    Protein CM-3 from Dendroaspis polylepis polylepis venom was purified by gel filtration and ion exchange chromatography. It comprises 65 amino acids including eight half-cystines. The complete amino acid sequence of protein CM-3 has been elucidated. The sequence (residues 1-50) resembles that of the N-terminal sequence of the subunits of a synergistic type protein and residues 51-65 that of the C-terminal sequence of an angusticeps type protein. Mixtures of protein CM-3 and angusticeps type proteins showed no apparent synergistic effect, in that their toxicity in combination was no greater than the sum of their individual toxicities.

  2. The amino acid sequences of the Fd fragments of two human γ heavy chains

    PubMed Central

    Press, E. M.; Hogg, N. M.

    1970-01-01

    The amino acid sequences of the Fd fragments of two human pathological immunoglobulins of the immunoglobulin G1 class are reported. Comparison of the two sequences shows that the heavy-chain variable regions are similar in length to those of the light chains. The existence of heavy chain variable region subgroups is also deduced, from a comparison of these two sequences with those of another γ 1 chain, Eu, a μ chain, Ou, and the partial sequence of a fourth γ 1 chain, Ste. Carbohydrate has been found to be linked to an aspartic acid residue in the variable region of one of the γ 1 chains, Cor. PMID:5449120

  3. Developmental variation and amino acid sequences of cytochromes c of the fruit fly Drosophila melanogaster and the flesh fly Boettcherisca peregrina.

    PubMed

    Inoue, S; Inoue, H; Hiroyoshi, T; Matsubara, H; Yamanaka, T

    1986-10-01

    The amino acid sequences of cytochromes c purified from the fruit fly Drosophila melanogaster and the flesh fly Boettcherisca peregrina were determined. In contrast with the case of the housefly, isocytochromes c were not detected in these flies at any developmental stage. The sequence of fruit fly cytochrome c differed from that reported previously but was identical with that predicted from the nucleotide sequence of the fruit fly cytochrome c gene (DC4) (Limbach, K.J. & Wu, R. (1985) Nucl. Acids Res. 13, 631-644). Isocytochrome c of the fruit fly, reported to be encoded by the DC3 gene, was not detected as a functional cytochrome c molecule.

  4. Cloning and sequencing of the medium-chain S-acyl fatty acid synthetase thioester hydrolase cDNA from rat mammary gland.

    PubMed Central

    Naggert, J; Williams, B; Cashman, D P; Smith, S

    1987-01-01

    cDNA clones coding for the medium-chain S-acyl fatty acid synthetase thioester hydrolase (thioesterase II) from rat mammary gland were identified in a bacteriophage lambda gt11 library and their nucleotide sequences were determined. The predicted coding region spans 263 amino acid residues and includes a sequence identical with that of a peptide derived from the enzyme active site. The rat thioesterase II cDNA sequence exhibits homology with that of a thioesterase found in duck uropygial glands. Images Fig. 3. PMID:3632637

  5. LIPPRED: A web server for accurate prediction of lipoprotein signal sequences and cleavage sites

    PubMed Central

    Taylor, Paul D; Toseland, Christopher P; Attwood, Teresa K; Flower, Darren R

    2006-01-01

    Bacterial lipoproteins have many important functions and represent a class of possible vaccine candidates. The prediction of lipoproteins from sequence is thus an important task for computational vaccinology. Naïve-Bayesian networks were trained to identify SpaseII cleavage sites and their preceding signal sequences using a set of 199 distinct lipoprotein sequences. A comprehensive range of sequence models was used to identify the best model for lipoprotein signal sequences. The best performing sequence model was found to be 10-residues in length, including the conserved cysteine lipid attachment site and the nine residues prior to it. The sensitivity of prediction for LipPred was 0.979, while the specificity was 0.742. Here, we describe LipPred, a web server for lipoprotein prediction; available at the URL: http://www.jenner.ac.uk/LipPred/. LipPred is the most accurate method available for the detection of SpaseIIcleaved lipoprotein signal sequences and the prediction of their cleavage sites. PMID:17597883

  6. Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy.

    PubMed

    van der Heijden, Thijn; van Vugt, Joke J F A; Logie, Colin; van Noort, John

    2012-09-18

    Nucleosome positioning dictates eukaryotic DNA compaction and access. To predict nucleosome positions in a statistical mechanics model, we exploited the knowledge that nucleosomes favor DNA sequences with specific periodically occurring dinucleotides. Our model is the first to capture both dyad position within a few base pairs, and free binding energy within 2 k(B)T, for all the known nucleosome positioning sequences. By applying Percus's equation to the derived energy landscape, we isolate sequence effects on genome-wide nucleosome occupancy from other factors that may influence nucleosome positioning. For both in vitro and in vivo systems, three parameters suffice to predict nucleosome occupancy with correlation coefficients of respectively 0.74 and 0.66. As predicted, we find the largest deviations in vivo around transcription start sites. This relatively simple algorithm can be used to guide future studies on the influence of DNA sequence on chromatin organization.

  7. The Chinese hamster Alu-equivalent sequence: a conserved highly repetitious, interspersed deoxyribonucleic acid sequence in mammals has a structure suggestive of a transposable element.

    PubMed Central

    Haynes, S R; Toomey, T P; Leinwand, L; Jelinek, W R

    1981-01-01

    A consensus sequence has been determined for a major interspersed deoxyribonucleic acid repeat in the genome of Chinese hamster ovary cells (CHO cells). This sequence is extensively homologous to (i) the human Alu sequence (P. L. Deininger et al., J. Mol. Biol., in press), (ii) the mouse B1 interspersed repetitious sequence (Krayev et al., Nucleic Acids Res. 8:1201-1215, 1980) (iii) an interspersed repetitious sequence from African green monkey deoxyribonucleic acid (Dhruva et al., Proc. Natl. Acad. Sci. U.S.A. 77:4514-4518, 1980) and (iv) the CHO and mouse 4.5S ribonucleic acid (this report; F. Harada and N. Kato, Nucleic Acids Res. 8:1273-1285, 1980). Because the CHO consensus sequence shows significant homology to the human Alu sequence it is termed the CHO Alu-equivalent sequence. A conserved structure surrounding CHO Alu-equivalent family members can be recognized. It is similar to that surrounding the human Alu and the mouse B1 sequences, and is represented as follows: direct repeat-CHO-Alu-A-rich sequence-direct repeat. A composite interspersed repetitious sequence has been identified. Its structure is represented as follows: direct repeat-residue 47 to 107 of CHO-Alu-non-Alu repetitious sequence-A-rich sequence-direct repeat. Because the Alu flanking sequences resemble those that flank known transposable elements, we think it likely that the Alu sequence dispersed throughout the mammalian genome by transposition. Images PMID:9279371

  8. Theoretical prediction of binding modes and hot sequences for allopsoralen DNA interaction

    NASA Astrophysics Data System (ADS)

    Méndez, Patricia Saenz; Guedes, Rita C.; dos Santos, Daniel J. V. A.; Eriksson, Leif A.

    2007-12-01

    Molecular docking studies of two duplex DNA sequences as target fragments and allopsoralen as ligand were performed. The calculated interaction energies showed that the ligand can be docked into the minor groove as well as become intercalated. However, unlike psoralen, allopsoralen preferred binding mode for non-poly-TA sequences is minor groove binding. Calculated energies for intercalation between different base pairs suggest that the predicted sequence selectivity for allopsoralen is analogous to that observed for psoralen. Intercalation is favored in 5'-TpA sites in poly-TA sequences.

  9. FTIR spectroscopy and sequence prediction: Structure of human α2-macroglobulin

    NASA Astrophysics Data System (ADS)

    Dukor, Rina K.; Liebman, Michael N.; Yuan, Anna I.; Feinman, Richard D.

    1998-06-01

    The structure of a plasma proteinase inhibitor α2-Macroglobulin (α2m) is determined by FTIR spectroscopy and a number of sequence-structure prediction algorithms. In addition, α2M dimers and complexes with methylamine and trypsin are examined. Our FTIR results estimate a helix content of 5-15% and a β-sheet content of 28-36%. None of the sequence prediction algorithms used in this study predicted values close to experimental data. Considerable differences in the FTIR spectra of α2M dimer are observed and somewhat smaller changes are seen upon reaction of α2M with methylamine and dithiodipyridine (DTP).

  10. The amino acid sequence of goat beta-lactoglobulin.

    PubMed

    Préaux, G; Braunitzer, G; Schrank, B; Stangl, A

    1979-11-01

    The isolation of beta-lactoglobulin from milk of the goat is described. The purified protein was checked for purity and has been characterized by its gross composition and end groups. The native or the modified protein was then degraded by tryptic and cyanogen bromide cleavage. The cleavage products were isolated and sequenced in the sequenator using a Quadrol and propyne program. These data provide the complete sequence of beta-lactoglobulin of the goat. The results are discussed and compared particularly with bovine beta-lactoglobulin components AB. Some biological aspects are described.

  11. Layered materials with coexisting acidic and basic sites for catalytic one-pot reaction sequences.

    PubMed

    Motokura, Ken; Tada, Mizuki; Iwasawa, Yasuhiro

    2009-06-17

    Acidic montmorillonite-immobilized primary amines (H-mont-NH(2)) were found to be excellent acid-base bifunctional catalysts for one-pot reaction sequences, which are the first materials with coexisting acid and base sites active for acid-base tamdem reactions. For example, tandem deacetalization-Knoevenagel condensation proceeded successfully with the H-mont-NH(2), affording the corresponding condensation product in a quantitative yield. The acidity of the H-mont-NH(2) was strongly influenced by the preparation solvent, and the base-catalyzed reactions were enhanced by interlayer acid sites.

  12. Using Whole-Genome Sequence Data to Predict Quantitative Trait Phenotypes in Drosophila melanogaster

    PubMed Central

    Ober, Ulrike; Ayroles, Julien F.; Stone, Eric A.; Richards, Stephen; Zhu, Dianhui; Gibbs, Richard A.; Stricker, Christian; Gianola, Daniel; Schlather, Martin; Mackay, Trudy F. C.; Simianer, Henner

    2012-01-01

    Predicting organismal phenotypes from genotype data is important for plant and animal breeding, medicine, and evolutionary biology. Genomic-based phenotype prediction has been applied for single-nucleotide polymorphism (SNP) genotyping platforms, but not using complete genome sequences. Here, we report genomic prediction for starvation stress resistance and startle response in Drosophila melanogaster, using ∼2.5 million SNPs determined by sequencing the Drosophila Genetic Reference Panel population of inbred lines. We constructed a genomic relationship matrix from the SNP data and used it in a genomic best linear unbiased prediction (GBLUP) model. We assessed predictive ability as the correlation between predicted genetic values and observed phenotypes by cross-validation, and found a predictive ability of 0.239±0.008 (0.230±0.012) for starvation resistance (startle response). The predictive ability of BayesB, a Bayesian method with internal SNP selection, was not greater than GBLUP. Selection of the 5% SNPs with either the highest absolute effect or variance explained did not improve predictive ability. Predictive ability decreased only when fewer than 150,000 SNPs were used to construct the genomic relationship matrix. We hypothesize that predictive power in this population stems from the SNP–based modeling of the subtle relationship structure caused by long-range linkage disequilibrium and not from population structure or SNPs in linkage disequilibrium with causal variants. We discuss the implications of these results for genomic prediction in other organisms. PMID:22570636

  13. [Cloning of full-length coding sequence of tree shrew CD4 and prediction of its molecular characteristics].

    PubMed

    Tian, Wei-Wei; Gao, Yue-Dong; Guo, Yan; Huang, Jing-Fei; Xiao, Chang; Li, Zuo-Sheng; Zhang, Hua-Tang

    2012-02-01

    The tree shrews, as an ideal animal model receiving extensive attentions to human disease research, demands essential research tools, in particular cellular markers and monoclonal antibodies for immunological studies. In this paper, a 1 365 bp of the full-length CD4 cDNA encoding sequence was cloned from total RNA in peripheral blood of tree shrews, the sequence completes two unknown fragment gaps of tree shrews predicted CD4 cDNA in the GenBank database, and its molecular characteristics were analyzed compared with other mammals by using biology software such as Clustal W2.0 and so forth. The results showed that the extracellular and intracellular domains of tree shrews CD4 amino acid sequence are conserved. The tree shrews CD4 amino acid sequence showed a close genetic relationship with Homo sapiens and Macaca mulatta. Most regions of the tree shrews CD4 molecule surface showed positive charges as humans. However, compared with CD4 extracellular domain D1 of human, CD4 D1 surface of tree shrews showed more negative charges, and more two N-glycosylation sites, which may affect antibody binding. This study provides a theoretical basis for the preparation and functional studies of CD4 monoclonal antibody.

  14. Computer Simulation of the Determination of Amino Acid Sequences in Polypeptides

    ERIC Educational Resources Information Center

    Daubert, Stephen D.; Sontum, Stephen F.

    1977-01-01

    Describes a computer program that generates a random string of amino acids and guides the student in determining the correct sequence of a given protein by using experimental analytic data for that protein. (MLH)

  15. Computer Simulation of the Determination of Amino Acid Sequences in Polypeptides

    ERIC Educational Resources Information Center

    Daubert, Stephen D.; Sontum, Stephen F.

    1977-01-01

    Describes a computer program that generates a random string of amino acids and guides the student in determining the correct sequence of a given protein by using experimental analytic data for that protein. (MLH)

  16. Synthesis of gamma,delta-unsaturated glycolic acids via sequenced brook and Ireland--claisen rearrangements.

    PubMed

    Schmitt, Daniel C; Johnson, Jeffrey S

    2010-03-05

    Organozinc, -magnesium, and -lithium nucleophiles initiate a Brook/Ireland-Claisen rearrangement sequence of allylic silyl glyoxylates resulting in the formation of gamma,delta-unsaturated alpha-silyloxy acids.

  17. Prediction of human rotavirus serotype by nucleotide sequence analysis of the VP7 protein gene.

    PubMed Central

    Green, K Y; Sears, J F; Taniguchi, K; Midthun, K; Hoshino, Y; Gorziglia, M; Nishikawa, K; Urasawa, S; Kapikian, A Z; Chanock, R M

    1988-01-01

    Human rotavirus field isolates were characterized by direct sequence analysis of the gene encoding the serotype-specific major neutralization protein (VP7). Single-stranded RNA transcripts were prepared from virus particles obtained directly from stool specimens or after two or three passages in MA-104 cells. Two regions of the gene (nucleotides 307 through 351 and 670 through 711) which had previously been shown to contain regions of sequence divergence among rotavirus serotypes were sequenced by the dideoxynucleotide method with two different synthetic oligonucleotide primers. The resulting nucleotide sequences were compared with the corresponding sequences from rotaviruses of known serotype (serotype 1, 2, 3, or 4). A total of 25 field isolates and 10 laboratory strains examined by this method exhibited marked sequence identity in both areas of the gene with the corresponding regions of 1 of the 4 reference strains. In addition, the predicted serotype from the sequence analysis correlated in each case with the serotype determined when the rotaviruses were examined by plaque reduction neutralization or reactivity with serotype-specific monoclonal antibodies. These data suggest that as a result of the high degree of sequence conservation observed among rotaviruses of the same serotype, it is possible to predict the serotype of a rotavirus isolate by direct sequence analysis of its VP7 gene. PMID:2833626

  18. A sequence-based two-level method for the prediction of type I secreted RTX proteins.

    PubMed

    Luo, Jiesi; Li, Wenling; Liu, Zhongyu; Guo, Yanzhi; Pu, Xuemei; Li, Menglong

    2015-05-07

    Many Gram-negative bacteria use the type I secretion system (T1SS) to translocate a wide range of substrates (type I secreted RTX proteins, T1SRPs) from the cytoplasm across the inner and outer membrane in one step to the extracellular space. Since T1SRPs play an important role in pathogen-host interactions, identifying them is crucial for a full understanding of the pathogenic mechanism of T1SS. However, experimental identification is often time-consuming and expensive. In the post-genomic era, it becomes imperative to predict new T1SRPs using information from the amino acid sequence alone when new proteins are being identified in a high-throughput mode. In this study, we report a two-level method for the first attempt to identify T1SRPs using sequence-derived features and the random forest (RF) algorithm. At the full-length sequence level, the results show that the unique feature of T1SRPs is the presence of variable numbers of the calcium-binding RTX repeats. These RTX repeats have a strong predictive power and so T1SRPs can be well distinguished from non-T1SRPs. At another level, different from that of the secretion signal, we find that a sequence segment located at the last 20-30 C-terminal amino acids may contain important signal information for T1SRP secretion because obvious differences were shown between the corresponding positions of T1SRPs and non-T1SRPs in terms of amino acid and secondary structure compositions. Using five-fold cross-validation, overall accuracies of 97% at the full-length sequence level and 89% at the secretion signal level were achieved through feature evaluation and optimization. Benchmarking on an independent dataset, our method could correctly predict 63 and 66 of 74 T1SRPs at the full-length sequence and secretion signal levels, respectively. We believe that this study will be useful in elucidating the secretion mechanism of T1SS and facilitating hypothesis-driven experimental design and validation.

  19. Genome sequence of the acid-tolerant strain Rhizobium sp. LPU83.

    PubMed

    Wibberg, Daniel; Tejerizo, Gonzalo Torres; Del Papa, María Florencia; Martini, Carla; Pühler, Alfred; Lagares, Antonio; Schlüter, Andreas; Pistorio, Mariano

    2014-04-20

    Rhizobia are important members of the soil microbiome since they enter into nitrogen-fixing symbiosis with different legume host plants. Rhizobium sp. LPU83 is an acid-tolerant Rhizobium strain featuring a broad-host-range. However, it is ineffective in nitrogen fixation. Here, the improved draft genome sequence of this strain is reported. Genome sequence information provides the basis for analysis of its acid tolerance, symbiotic properties and taxonomic classification.

  20. Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory.

    PubMed

    Xiaohui, Niu; Nana, Li; Jingbo, Xia; Dingyan, Chen; Yuehua, Peng; Yang, Xiao; Weiquan, Wei; Dongming, Wang; Zengzhen, Wang

    2013-09-07

    Protein solubility plays a major role and has strong implication in the proteomics. Predicting the propensity of a protein to be soluble or to form inclusion body is a fundamental and not fairly resolved problem. In order to predict the protein solubility, almost 10,000 protein sequences were downloaded from NCBI. Then the sequences were eliminated for the high homologous similarity by CD-HIT. Thus, there were 5692 sequences remained. Based on protein sequences, amino acid and dipeptide compositions were generally extracted to predict protein solubility. In this study, the entropy in information theory was introduced as another predictive factor in the model. Experiments involving nine different feature vector combinations, including the above-mentioned three kinds of factors, were conducted with support vector machines (SVMs) as prediction engine. Each combination was evaluated by re-substitution test and 10-fold cross-validation test. According to the evaluation results, the accuracies and Matthew's Correlation Coefficient (MCC) values were boosted by the introduction of the entropy. The best combination was the one with amino acid, dipeptide compositions and their entropies. Its accuracy reached 90.34% and Matthew's Correlation Coefficient (MCC) value was 0.7494 in re-substitution test, while 88.12% and 0.7945 respectively for 10-fold cross-validation. In conclusion, the introduction of the entropy significantly improved the performance of the predictive method. Copyright © 2013. Published by Elsevier Ltd.

  1. The amino acid sequence of monal pheasant lysozyme and its activity.

    PubMed

    Araki, T; Matsumoto, T; Torikata, T

    1998-10-01

    The amino acid sequence of monal pheasant lysozyme and its activity were analyzed. Carboxymethylated lysozyme was digested with trypsin and the resulting peptides were sequenced. The established amino acid sequence had one amino acid substitution at position 102 (Arg to Gly) comparing with Indian peafowl lysozyme and four amino acid substitutions at positions 3 (Phe to Tyr), 15 (His to Leu), 41 (Gln to His), and 121 (Gln to His) with chicken lysozyme. Analysis of the time-courses of reaction using N-acetylglucosamine pentamer as a substrate showed a difference of binding free energy change (-0.4 kcal/mol) at subsites A between monal pheasant and Indian peafowl lysozyme. This was assumed to be caused by the amino acid substitution at subsite A with loss of a positive charge at position 102 (Arg102 to Gly).

  2. Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network.

    PubMed

    Lyons, James; Dehzangi, Abdollah; Heffernan, Rhys; Sharma, Alok; Paliwal, Kuldip; Sattar, Abdul; Zhou, Yaoqi; Yang, Yuedong

    2014-10-30

    Because a nearly constant distance between two neighbouring Cα atoms, local backbone structure of proteins can be represented accurately by the angle between C(αi-1)-C(αi)-C(αi+1) (θ) and a dihedral angle rotated about the C(αi)-C(αi+1) bond (τ). θ and τ angles, as the representative of structural properties of three to four amino-acid residues, offer a description of backbone conformations that is complementary to φ and ψ angles (single residue) and secondary structures (>3 residues). Here, we report the first machine-learning technique for sequence-based prediction of θ and τ angles. Predicted angles based on an independent test have a mean absolute error of 9° for θ and 34° for τ with a distribution on the θ-τ plane close to that of native values. The average root-mean-square distance of 10-residue fragment structures constructed from predicted θ and τ angles is only 1.9Å from their corresponding native structures. Predicted θ and τ angles are expected to be complementary to predicted ϕ and ψ angles and secondary structures for using in model validation and template-based as well as template-free structure prediction. The deep neural network learning technique is available as an on-line server called Structural Property prediction with Integrated DEep neuRal network (SPIDER) at http://sparks-lab.org.

  3. Single-chain structure of human ceruloplasmin: the complete amino acid sequence of the whole molecule.

    PubMed Central

    Takahashi, N; Ortel, T L; Putnam, F W

    1984-01-01

    We have determined the amino acid sequence of the amino-terminal 67,000-dalton (67-kDa) fragment of human ceruloplasmin and have established overlapping sequences between the 67-kDa and 50-kDa fragments and between the 50-kDa and 19-kDa fragments. The 67-kDa fragment contains 480 amino acid residues and three glucosamine oligosaccharides. These results together with our previous sequence data for the 50-kDa and 19-kDa fragments complete the amino acid sequence of human ceruloplasmin. The polypeptide chain has a total of 1,046 amino acid residues (Mr 120,085) and has attachment sites for four glucosamine oligosaccharides; together these account for the total molecular mass of human ceruloplasmin (132 kDa). The sequence analysis of the peptides overlapping the fragments showed that one additional amino acid, arginine, is present between the 67-kDa and 50-kDa fragments, and another, lysine, is between the 50-kDa and 19-kDa fragments. Only two apparent sites of amino acid interchange have been identified in the polypeptide chain. Both involve a single-point interchange of glycine and lysine that would result in a difference in charge. The results of the complete sequence analysis verified that human ceruloplasmin is composed of a single polypeptide chain and that the subunit-like fragments are produced by proteolytic cleavage during purification (and possibly also in vivo). PMID:6582496

  4. Myoglobin of the shark Heterodontus portusjacksoni: isolation and amino acid sequence.

    PubMed

    Fisher, W K; Thompson, E O

    1979-06-01

    Myoglobin isolated from red muscle of the shark H. portusjacksoni was purified by ion-exchange chromatography on sulfopropyl-Sephadex and gel-filtration. Amino acid analysis and sequence determination showed 148 amino acid residues. The amino terminal residue is acetylated as shown by mass spectrographic analysis of N-terminal peptides. There is a deletion of four residues at the amino terminal end as well as one residue in the CD interhelical area relative to other myoglobins. The complete amino acid sequence has been determined following digestion with trypsin, chymotrypsin, pepsin and staphylococcal protease. Sequences of the purified peptides were determined by the dansyl-Edman procedure. The amino acid sequence showed approximately 85 differences from mammalian, monotreme and bird myoglobins. The date of divergence of the shark H. portusjacksoni from these other orders was estimated at 450 +/- 16 million years, based on the number of amino acid differences between species and allowing for multiple mutations during the evolutionary period. This estimate agrees well with similar estimates made using alpha- and beta-globin sequences, in contrast to widely differing estimates of dates of divergence for monotremes using the same three globin chains. Compared with myoglobins from species previously studied, there are many more differences in amino acid sequences, and in many positions residues are found that are more characteristic of alpha- and beta-globins, suggesting a conservation of residues over a long period of evolutionary time. There are fewer stabilizing hydrogen bonds and salt-linkages than in other myoglobins.

  5. Multiple Genome Sequences of Important Beer-Spoiling Lactic Acid Bacteria.

    PubMed

    Geissler, Andreas J; Behr, Jürgen; Vogel, Rudi F

    2016-10-06

    Seven strains of important beer-spoiling lactic acid bacteria were sequenced using single-molecule real-time sequencing. Complete genomes were obtained for strains of Lactobacillus paracollinoides, Lactobacillus lindneri, and Pediococcus claussenii The analysis of these genomes emphasizes the role of plasmids as the genomic foundation of beer-spoiling ability. Copyright © 2016 Geissler et al.

  6. Multiple Genome Sequences of Important Beer-Spoiling Lactic Acid Bacteria

    PubMed Central

    Geissler, Andreas J.; Vogel, Rudi F.

    2016-01-01

    Seven strains of important beer-spoiling lactic acid bacteria were sequenced using single-molecule real-time sequencing. Complete genomes were obtained for strains of Lactobacillus paracollinoides, Lactobacillus lindneri, and Pediococcus claussenii. The analysis of these genomes emphasizes the role of plasmids as the genomic foundation of beer-spoiling ability. PMID:27795248

  7. EST-PAC a web package for EST annotation and protein sequence prediction.

    PubMed

    Strahm, Yvan; Powell, David; Lefèvre, Christophe

    2006-10-12

    With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST) from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST) annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1) searching local or remote biological databases for sequence similarities using Blast services, 2) predicting protein coding sequence from EST data and, 3) annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.

  8. Complete genome sequence of Enterococcus mundtii QU 25, an efficient L-(+)-lactic acid-producing bacterium.

    PubMed

    Shiwa, Yuh; Yanase, Hiroaki; Hirose, Yuu; Satomi, Shohei; Araya-Kojima, Tomoko; Watanabe, Satoru; Zendo, Takeshi; Chibazakura, Taku; Shimizu-Kadota, Mariko; Yoshikawa, Hirofumi; Sonomoto, Kenji

    2014-08-01

    Enterococcus mundtii QU 25, a non-dairy bacterial strain of ovine faecal origin, can ferment both cellobiose and xylose to produce l-lactic acid. The use of this strain is highly desirable for economical l-lactate production from renewable biomass substrates. Genome sequence determination is necessary for the genetic improvement of this strain. We report the complete genome sequence of strain QU 25, primarily determined using Pacific Biosciences sequencing technology. The E. mundtii QU 25 genome comprises a 3 022 186-bp single circular chromosome (GC content, 38.6%) and five circular plasmids: pQY182, pQY082, pQY039, pQY024, and pQY003. In all, 2900 protein-coding sequences, 63 tRNA genes, and 6 rRNA operons were predicted in the QU 25 chromosome. Plasmid pQY024 harbours genes for mundticin production. We found that strain QU 25 produces a bacteriocin, suggesting that mundticin-encoded genes on plasmid pQY024 were functional. For lactic acid fermentation, two gene clusters were identified-one involved in the initial metabolism of xylose and uptake of pentose and the second containing genes for the pentose phosphate pathway and uptake of related sugars. This is the first complete genome sequence of an E. mundtii strain. The data provide insights into lactate production in this bacterium and its evolution among enterococci.

  9. State of the art and challenges in sequence based T-cell epitope prediction

    PubMed Central

    2010-01-01

    Sequence based T-cell epitope predictions have improved immensely in the last decade. From predictions of peptide binding to major histocompatibility complex molecules with moderate accuracy, limited allele coverage, and no good estimates of the other events in the antigen-processing pathway, the field has evolved significantly. Methods have now been developed that produce highly accurate binding predictions for many alleles and integrate both proteasomal cleavage and transport events. Moreover have so-called pan-specific methods been developed, which allow for prediction of peptide binding to MHC alleles characterized by limited or no peptide binding data. Most of the developed methods are publicly available, and have proven to be very useful as a shortcut in epitope discovery. Here, we will go through some of the history of sequence-based predictions of helper as well as cytotoxic T cell epitopes. We will focus on some of the most accurate methods and their basic background. PMID:21067545

  10. Complete genome sequence of the probiotic lactic acid bacterium Lactobacillus acidophilus NCFM

    PubMed Central

    Altermann, Eric; Russell, W. Michael; Azcarate-Peril, M. Andrea; Barrangou, Rodolphe; Buck, B. Logan; McAuliffe, Olivia; Souther, Nicole; Dobson, Alleson; Duong, Tri; Callanan, Michael; Lick, Sonja; Hamrick, Alice; Cano, Raul; Klaenhammer, Todd R.

    2005-01-01

    Lactobacillus acidophilus NCFM is a probiotic bacterium that has been produced commercially since 1972. The complete genome is 1,993,564 nt and devoid of plasmids. The average GC content is 34.71% with 1,864 predicted ORFs, of which 72.5% were functionally classified. Nine phage-related integrases were predicted, but no complete prophages were found. However, three unique regions designated as potential autonomous units (PAUs) were identified. These units resemble a unique structure and bear characteristics of both plasmids and phages. Analysis of the three PAUs revealed the presence of two R/M systems and a prophage maintenance system killer protein. A spacers interspersed direct repeat locus containing 32 nearly perfect 29-bp repeats was discovered and may provide a unique molecular signature for this organism. In silico analyses predicted 17 transposase genes and a chromosomal locus for lactacin B, a class II bacteriocin. Several mucus- and fibronectin-binding proteins, implicated in adhesion to human intestinal cells, were also identified. Gene clusters for transport of a diverse group of carbohydrates, including fructooligosaccharides and raffinose, were present and often accompanied by transcriptional regulators of the lacI family. For protein degradation and peptide utilization, the organism encoded 20 putative peptidases, homologs for PrtP and PrtM, and two complete oligopeptide transport systems. Nine two-component regulatory systems were predicted, some associated with determinants implicated in bacteriocin production and acid tolerance. Collectively, these features within the genome sequence of L. acidophilus are likely to contribute to the organisms' gastric survival and promote interactions with the intestinal mucosa and microbiota. PMID:15671160

  11. Phenotype-optimized sequence ensembles substantially improve prediction of disease-causing mutation in cystic fibrosis.

    PubMed

    Masica, David L; Sosnay, Patrick R; Cutting, Garry R; Karchin, Rachel

    2012-08-01

    Cystic fibrosis transmembrane conductance regulator (CFTR) mutation is associated with a phenotypic spectrum that includes cystic fibrosis (CF). The disease liability of some common CFTR mutations is known, but rare mutations are seen in too few patients to categorize unequivocally, making genetic diagnosis difficult. Computational methods can predict the impact of mutation, but prediction specificity is often below that required for clinical utility. Here, we present a novel supervised learning approach for predicting CF from CFTR missense mutation. The algorithm begins by constructing custom multiple sequence alignments called phenotype-optimized sequence ensembles (POSEs). POSEs are constructed iteratively, by selecting sequences that optimize predictive performance on a training set of CFTR mutations of known clinical significance. Next, we predict CF disease liability from a different set of CFTR mutations (test-set mutations). This approach achieves improved prediction performance relative to popular methods recently assessed using the same test-set mutations. Of clinical significance, our method achieves 94% prediction specificity. Because databases such as HGMD and locus-specific mutation databases are growing rapidly, methods that automatically tailor their predictions for a specific phenotype may be of immediate utility. If the performance achieved here generalizes to other systems, the approach could be an excellent tool to help establish genetic diagnoses.

  12. PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.

    PubMed

    Mirarab, Siavash; Nguyen, Nam; Guo, Sheng; Wang, Li-San; Kim, Junhyong; Warnow, Tandy

    2015-05-01

    We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate--slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.

  13. HomPPI: a class of sequence homology based protein-protein interface prediction methods

    PubMed Central

    2011-01-01

    Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the

  14. HomPPI: a class of sequence homology based protein-protein interface prediction methods.

    PubMed

    Xue, Li C; Dobbs, Drena; Honavar, Vasant

    2011-06-17

    Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence.Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein.Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably

  15. SETG: Nucleic Acid Extraction and Sequencing for In Situ Life Detection on Mars

    NASA Astrophysics Data System (ADS)

    Mojarro, A.; Hachey, J.; Tani, J.; Smith, A.; Bhattaru, S. A.; Pontefract, A.; Doebler, R.; Brown, M.; Ruvkun, G.; Zuber, M. T.; Carr, C. E.

    2016-10-01

    We are developing an integrated nucleic acid extraction and sequencing instrument: the Search for Extra-Terrestrial Genomes (SETG) for in situ life detection on Mars. Our goals are to identify related or unrelated nucleic acid-based life on Mars.

  16. Predicting the Genetic Stability of Engineered DNA Sequences with the EFM Calculator.

    PubMed

    Jack, Benjamin R; Leonard, Sean P; Mishler, Dennis M; Renda, Brian A; Leon, Dacia; Suárez, Gabriel A; Barrick, Jeffrey E

    2015-08-21

    Unwanted evolution can rapidly degrade the performance of genetically engineered circuits and metabolic pathways installed in living organisms. We created the Evolutionary Failure Mode (EFM) Calculator to computationally detect common sources of genetic instability in an input DNA sequence. It predicts two types of mutational hotspots: deletions mediated by homologous recombination and indels caused by replication slippage on simple sequence repeats. We tested the performance of our algorithm on genetic circuits that were previously redesigned for greater evolutionary reliability and analyzed the stability of sequences in the iGEM Registry of Standard Biological Parts. More than half of the parts in the Registry are predicted to experience >100-fold elevated mutation rates due to the inclusion of unstable sequence configurations. We anticipate that the EFM Calculator will be a useful negative design tool for avoiding volatile DNA encodings, thereby increasing the evolutionary lifetimes of synthetic biology devices.

  17. Severe accident source term characteristics for selected Peach Bottom sequences predicted by the MELCOR Code

    SciTech Connect

    Carbajo, J.J.

    1993-09-01

    The purpose of this report is to compare in-containment source terms developed for NUREG-1159, which used the Source Term Code Package (STCP), with those generated by MELCOR to identify significant differences. For this comparison, two short-term depressurized station blackout sequences (with a dry cavity and with a flooded cavity) and a Loss-of-Coolant Accident (LOCA) concurrent with complete loss of the Emergency Core Cooling System (ECCS) were analyzed for the Peach Bottom Atomic Power Station (a BWR-4 with a Mark I containment). The results indicate that for the sequences analyzed, the two codes predict similar total in-containment release fractions for each of the element groups. However, the MELCOR/CORBH Package predicts significantly longer times for vessel failure and reduced energy of the released material for the station blackout sequences (when compared to the STCP results). MELCOR also calculated smaller releases into the environment than STCP for the station blackout sequences.

  18. Parvalbumins from coelacanth muscle. III. Amino acid sequence of the major component.

    PubMed

    Jauregui-Adell, J; Pechere, J F

    1978-09-26

    The primary structure of the major parvalbumin (pI = 4.52) from coelacanth muscle (Latimeria chalumnae) has been determined. Sequence analysis of the tryptic peptides, in some cases obtained with beta-trypsin, accounts for the total amino acid content of the protein. Chymotryptic peptides provide appropriate sequence overlaps, to complete the localization of the tryptic peptides. Examination of the amino acid sequence of this protein shows the typical structure of a beta-parvalbumin. Its position in the dendrogram of related calcium-binding proteins corresponds to that usually accepted for crossopterygians.

  19. Genomic prediction for beef fatty acid profile in Nellore cattle.

    PubMed

    Chiaia, Hermenegildo Lucas Justino; Peripoli, Elisa; Silva, Rafael Medeiros de Oliveira; Aboujaoude, Carolyn; Feitosa, Fabiele Loise Braga; Lemos, Marcos Vinicius Antunes de; Berton, Mariana Piatto; Olivieri, Bianca Ferreira; Espigolan, Rafael; Tonussi, Rafael Lara; Gordo, Daniel Gustavo Mansan; Bresolin, Tiago; Magalhães, Ana Fabrícia Braga; Júnior, Gerardo Alves Fernandes; Albuquerque, Lúcia Galvão de; Oliveira, Henrique Nunes de; Furlan, Joyce de Jesus Mangini; Ferrinho, Adrielle Mathias; Mueller, Lenise Freitas; Tonhati, Humberto; Pereira, Angélica Simone Cravo; Baldi, Fernando

    2017-06-01

    The objective of this study was to compare SNP-BLUP, BayesCπ, BayesC and Bayesian Lasso methodologies to predict the direct genomic value for saturated, monounsaturated, and polyunsaturated fatty acid profile, omega 3 and 6 in the Longissimus thoracis muscle of Nellore cattle finished in feedlot. A total of 963 Nellore bulls with phenotype for fatty acid profiles, were genotyped using the Illumina BovineHD BeadChip (Illumina, San Diego, CA) with 777,962 SNP. The predictive ability was evaluated using cross validation. To compare the methodologies, the correlation between DGV and pseudo-phenotypes was calculated. The accuracy varied from -0.40 to 0.62. Our results indicate that none of the methods excelled in terms of accuracy, however, the SNP-BLUP method allows obtaining less biased genomic evaluations, thereby; this method is more feasible when taking into account the analyses' operating cost. Despite the lowest bias observed for EBV, the adjusted phenotype is the preferred pseudophenotype considering the genomic prediction accuracies regarding the context of the present study.

  20. Amino acid sequence of winged bean (Psophocarpus tetragonolobus (L.) DC.) chymotrypsin inhibitor, WCI-3.

    PubMed

    Shibata, H; Hara, S; Ikenaka, T

    1988-10-01

    The complete amino acid sequence of winged bean chymotrypsin inhibitor 3 (WCI-3) was determined by the conventional methods. WCI-3 consisted of 183 amino acid residues, but was heterogeneous in the carboxyl terminal region owing to the loss of one to four carboxyl terminal amino acid residues. The sequence of WCI-3 was highly homologous with those of soybean trypsin inhibitor Tia, winged bean trypsin inhibitor WTI-1, and Erythrina latissima trypsin inhibitor DE-3. One of the reactive site peptide bonds of WCI-3 was identified as Leu(65)-Ser(66), which was located at the same position as those of the other Kunitz-family leguminous proteinase inhibitors.

  1. Purification, characterization and partial amino acid sequence of glycogen synthase from Saccharomyces cerevisiae.

    PubMed Central

    Carabaza, A; Arino, J; Fox, J W; Villar-Palasi, C; Guinovart, J J

    1990-01-01

    Glycogen synthase from Saccharomyces cerevisiae was purified to homogeneity. The enzyme showed a subunit molecular mass of 80 kDa. The holoenzyme appears to be a tetramer. Antibodies developed against purified yeast glycogen synthase inactivated the enzyme in yeast extracts and allowed the detection of the protein in Western blots. Amino acid analysis showed that the enzyme is very rich in glutamate and/or glutamine residues. The N-terminal sequence (11 amino acid residues) was determined. In addition, selected tryptic-digest peptides were purified by reverse-phase h.p.l.c. and submitted to gas-phase sequencing. Up to eight sequences (79 amino acid residues) could be aligned with the human muscle enzyme sequence. Levels of identity range between 37 and 100%, indicating that, although human and yeast glycogen synthases probably share some conserved regions, significant differences in their primary structure should be expected. Images Fig. 1. Fig. 2. Fig. 3. PMID:2114092

  2. Amino acid sequence of anionic peroxidase from the windmill palm tree Trachycarpus fortunei.

    PubMed

    Baker, Margaret R; Zhao, Hongwei; Sakharov, Ivan Yu; Li, Qing X

    2014-12-10

    Palm peroxidases are extremely stable and have uncommon substrate specificity. This study was designed to fill in the knowledge gap about the structures of a peroxidase from the windmill palm tree Trachycarpus fortunei. The complete amino acid sequence and partial glycosylation were determined by MALDI-top-down sequencing of native windmill palm tree peroxidase (WPTP), MALDI-TOF/TOF MS/MS of WPTP tryptic peptides, and cDNA sequencing. The propeptide of WPTP contained N- and C-terminal signal sequences which contained 21 and 17 amino acid residues, respectively. Mature WPTP was 306 amino acids in length, and its carbohydrate content ranged from 21% to 29%. Comparison to closely related royal palm tree peroxidase revealed structural features that may explain differences in their substrate specificity. The results can be used to guide engineering of WPTP and its novel applications.

  3. Amino Acid Sequence of Anionic Peroxidase from the Windmill Palm Tree Trachycarpus fortunei

    PubMed Central

    2015-01-01

    Palm peroxidases are extremely stable and have uncommon substrate specificity. This study was designed to fill in the knowledge gap about the structures of a peroxidase from the windmill palm tree Trachycarpus fortunei. The complete amino acid sequence and partial glycosylation were determined by MALDI-top-down sequencing of native windmill palm tree peroxidase (WPTP), MALDI-TOF/TOF MS/MS of WPTP tryptic peptides, and cDNA sequencing. The propeptide of WPTP contained N- and C-terminal signal sequences which contained 21 and 17 amino acid residues, respectively. Mature WPTP was 306 amino acids in length, and its carbohydrate content ranged from 21% to 29%. Comparison to closely related royal palm tree peroxidase revealed structural features that may explain differences in their substrate specificity. The results can be used to guide engineering of WPTP and its novel applications. PMID:25383699

  4. αIIbβ3 variants defined by next-generation sequencing: Predicting variants likely to cause Glanzmann thrombasthenia

    PubMed Central

    Buitrago, Lorena; Rendon, Augusto; Liang, Yupu; Simeoni, Ilenia; Negri, Ana; Filizola, Marta; Ouwehand, Willem H.; Coller, Barry S.; Alessi, Marie-Christine; Ballmaier, Matthias; Bariana, Tadbir; Bellissimo, Daniel; Bertoli, Marta; Bray, Paul; Bury, Loredana; Carrell, Robin; Cattaneo, Marco; Collins, Peter; French, Deborah; Favier, Remi; Freson, Kathleen; Furie, Bruce; Germeshausen, Manuela; Ghevaert, Cedric; Gomez, Keith; Goodeve, Anne; Gresele, Paolo; Guerrero, Jose; Hampshire, Dan J.; Hadinnapola, Charaka; Heemskerk, Johan; Henskens, Yvonne; Hill, Marian; Hogg, Nancy; Johnsen, Jill; Kahr, Walter; Kerr, Ron; Kunishima, Shinji; Laffan, Michael; Natwani, Amit; Neerman-Arbez, Marguerite; Nurden, Paquita; Nurden, Alan; Ormiston, Mark; Othman, Maha; Ouwehand, Willem; Perry, David; Vilk, Shoshana Ravel; Reitsma, Pieter; Rondina, Matthew; Simeoni, Ilenia; Smethurst, Peter; Stephens, Jonathan; Stevenson, William; Szkotak, Artur; Turro, Ernest; Van Geet, Christel; Vries, Minka; Ward, June; Waye, John; Westbury, Sarah; Whiteheart, Sidney; Wilcox, David; Zhang, Bi

    2015-01-01

    Next-generation sequencing is transforming our understanding of human genetic variation but assessing the functional impact of novel variants presents challenges. We analyzed missense variants in the integrin αIIbβ3 receptor subunit genes ITGA2B and ITGB3 identified by whole-exome or -genome sequencing in the ThromboGenomics project, comprising ∼32,000 alleles from 16,108 individuals. We analyzed the results in comparison with 111 missense variants in these genes previously reported as being associated with Glanzmann thrombasthenia (GT), 20 associated with alloimmune thrombocytopenia, and 5 associated with aniso/macrothrombocytopenia. We identified 114 novel missense variants in ITGA2B (affecting ∼11% of the amino acids) and 68 novel missense variants in ITGB3 (affecting ∼9% of the amino acids). Of the variants, 96% had minor allele frequencies (MAF) < 0.1%, indicating their rarity. Based on sequence conservation, MAF, and location on a complete model of αIIbβ3, we selected three novel variants that affect amino acids previously associated with GT for expression in HEK293 cells. αIIb P176H and β3 C547G severely reduced αIIbβ3 expression, whereas αIIb P943A partially reduced αIIbβ3 expression and had no effect on fibrinogen binding. We used receiver operating characteristic curves of combined annotation-dependent depletion, Polyphen 2-HDIV, and sorting intolerant from tolerant to estimate the percentage of novel variants likely to be deleterious. At optimal cut-off values, which had 69–98% sensitivity in detecting GT mutations, between 27% and 71% of the novel αIIb or β3 missense variants were predicted to be deleterious. Our data have implications for understanding the evolutionary pressure on αIIbβ3 and highlight the challenges in predicting the clinical significance of novel missense variants. PMID:25827233

  5. αIIbβ3 variants defined by next-generation sequencing: predicting variants likely to cause Glanzmann thrombasthenia.

    PubMed

    Buitrago, Lorena; Rendon, Augusto; Liang, Yupu; Simeoni, Ilenia; Negri, Ana; Filizola, Marta; Ouwehand, Willem H; Coller, Barry S

    2015-04-14

    Next-generation sequencing is transforming our understanding of human genetic variation but assessing the functional impact of novel variants presents challenges. We analyzed missense variants in the integrin αIIbβ3 receptor subunit genes ITGA2B and ITGB3 identified by whole-exome or -genome sequencing in the ThromboGenomics project, comprising ∼32,000 alleles from 16,108 individuals. We analyzed the results in comparison with 111 missense variants in these genes previously reported as being associated with Glanzmann thrombasthenia (GT), 20 associated with alloimmune thrombocytopenia, and 5 associated with aniso/macrothrombocytopenia. We identified 114 novel missense variants in ITGA2B (affecting ∼11% of the amino acids) and 68 novel missense variants in ITGB3 (affecting ∼9% of the amino acids). Of the variants, 96% had minor allele frequencies (MAF) < 0.1%, indicating their rarity. Based on sequence conservation, MAF, and location on a complete model of αIIbβ3, we selected three novel variants that affect amino acids previously associated with GT for expression in HEK293 cells. αIIb P176H and β3 C547G severely reduced αIIbβ3 expression, whereas αIIb P943A partially reduced αIIbβ3 expression and had no effect on fibrinogen binding. We used receiver operating characteristic curves of combined annotation-dependent depletion, Polyphen 2-HDIV, and sorting intolerant from tolerant to estimate the percentage of novel variants likely to be deleterious. At optimal cut-off values, which had 69-98% sensitivity in detecting GT mutations, between 27% and 71% of the novel αIIb or β3 missense variants were predicted to be deleterious. Our data have implications for understanding the evolutionary pressure on αIIbβ3 and highlight the challenges in predicting the clinical significance of novel missense variants.

  6. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations.

    PubMed

    Abascal, Federico; Zardoya, Rafael; Telford, Maximilian J

    2010-07-01

    We present TranslatorX, a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations. Many comparisons between biological sequences (nucleic acids and proteins) involve the construction of multiple alignments. Alignments represent a statement regarding the homology between individual nucleotides or amino acids within homologous genes. As protein-coding DNA sequences evolve as triplets of nucleotides (codons) and it is known that sequence similarity degrades more rapidly at the DNA than at the amino acid level, alignments are generally more accurate when based on amino acids than on their corresponding nucleotides. TranslatorX novelties include: (i) use of all documented genetic codes and the possibility of assigning different genetic codes for each sequence; (ii) a battery of different multiple alignment programs; (iii) translation of ambiguous codons when possible; (iv) an innovative criterion to clean nucleotide alignments with GBlocks based on protein information; and (v) a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments. The TranslatorX server is freely available at http://translatorx.co.uk.

  7. Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment.

    PubMed

    Schneider, Georg; Wildpaner, Michael; Sirota, Fernanda L; Maurer-Stroh, Sebastian; Eisenhaber, Birgit; Eisenhaber, Frank

    2010-01-01

    Given the amount of sequence data available today, in silico function prediction, which often includes detecting distant evolutionary relationships, requires sophisticated bioinformatic workflows. The algorithms behind these workflows exhibit complex data structures; they need the ability to spawn subtasks and tend to demand large amounts of resources. Performing sequence analytic tasks by manually invoking individual function prediction algorithms having to transform between differing input and output formats has become increasingly obsolete. After a period of linking individual predictors using ad hoc scripts, a number of integrated platforms are finally emerging. We present the ANNOTATOR software environment as an advanced example of such a platform.

  8. Nucleotide and deduced amino acid sequences of a new subtilisin from an alkaliphilic Bacillus isolate.

    PubMed

    Saeki, Katsuhisa; Magallones, Marietta V; Takimura, Yasushi; Hatada, Yuji; Kobayashi, Tohru; Kawai, Shuji; Ito, Susumu

    2003-10-01

    The gene for a new subtilisin from the alkaliphilic Bacillus sp. KSM-LD1 was cloned and sequenced. The open reading frame of the gene encoded a 97 amino-acid prepro-peptide plus a 307 amino-acid mature enzyme that contained a possible catalytic triad of residues, Asp32, His66, and Ser224. The deduced amino acid sequence of the mature enzyme (LD1) showed approximately 65% identity to those of subtilisins SprC and SprD from alkaliphilic Bacillus sp. LG12. The amino acid sequence identities of LD1 to those of previously reported true subtilisins and high-alkaline proteases were below 60%. LD1 was characteristically stable during incubation with surfactants and chemical oxidants. Interestingly, an oxidizable Met residue is located next to the catalytic Ser224 of the enzyme as in the cases of the oxidation-susceptible subtilisins reported to date.

  9. Amino acid sequence of homologous rat atrial peptides: natriuretic activity of native and synthetic forms.

    PubMed Central

    Seidah, N G; Lazure, C; Chrétien, M; Thibault, G; Garcia, R; Cantin, M; Genest, J; Nutt, R F; Brady, S F; Lyle, T A

    1984-01-01

    A substance called atrial natriuretic factor (ANF), localized in secretory granules of atrial cardiocytes, was isolated as four homologous natriuretic peptides from homogenates of rat atria. The complete sequence of the longest form showed that it is composed of 33 amino acids. The three other shorter forms (2-33, 3-33, and 8-33) represent amino-terminally truncated versions of the 33 amino acid parent molecule as shown by analysis of sequence, amino acid composition, or both. The proposed primary structure agrees entirely with the amino acid composition and reveals no significant sequence homology with any known protein or segment of protein. The short form ANF-(8-33) was synthesized by a multi-fragment condensation approach and the synthetic product was shown to exhibit specific activity comparable to that of the natural ANF-(3-33). PMID:6232612

  10. High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABERTOOTH

    PubMed Central

    2010-01-01

    Background Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from random. For such distant pairs, structure alignment is of much better quality. Nevertheless, sequence alignment is the only choice in the majority of cases where structural data is not available. This situation demands development of methods that extend the applicability of accurate sequence alignment to distantly related proteins. Results We develop a sequence alignment method that combines the prediction of a structural profile based on the protein's sequence with the alignment of that profile using our recently published alignment tool SABERTOOTH. In particular, we predict the contact vector of protein structures using an artificial neural network based on position-specific scoring matrices generated by PSI-BLAST and align these predicted contact vectors. The resulting sequence alignments are assessed using two different tests: First, we assess the alignment quality by measuring the derived structural similarity for cases in which structures are available. In a second test, we quantify the ability of the significance score of the alignments to recognize structural and evolutionary relationships. As a benchmark we use a representative set of the SCOP (structural classification of proteins) database, with similarities ranging from closely related proteins at SCOP family level, to very distantly related proteins at SCOP fold level. Comparing these results with some prominent sequence alignment tools, we find that SABERTOOTH produces sequence alignments of better quality than those of Clustal W, T-Coffee, MUSCLE, and PSI-BLAST. HHpred, one of the most sophisticated and computationally expensive tools available, outperforms our alignment algorithm at family and superfamily levels

  11. Sequence-specific flexibility organization of splicing flanking sequence and prediction of splice sites in the human genome.

    PubMed

    Zuo, Yongchun; Zhang, Pengfei; Liu, Li; Li, Tao; Peng, Yong; Li, Guangpeng; Li, Qianzhong

    2014-09-01

    More and more reported results of nucleosome positioning and histone modifications showed that DNA structure play a well-established role in splicing. In this study, a set of DNA geometric flexibility parameters originated from molecular dynamics (MD) simulations were introduced to discuss the structure organization around splice sites at the DNA level. The obtained profiles of specific flexibility/stiffness around splice sites indicated that the DNA physical-geometry deformation could be used as an alternative way to describe the splicing junction region. In combination with structural flexibility as discriminatory parameter, we developed a hybrid computational model for predicting potential splicing sites. And the better prediction performance was achieved when the benchmark dataset evaluated. Our results showed that the mechanical deformability character of a splice junction is closely correlated with both the splice site strength and structural information in its flanking sequences.

  12. SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments

    PubMed Central

    Wiehe, Thomas; Gebauer-Jung, Steffi; Mitchell-Olds, Thomas; Guigó, Roderic

    2001-01-01

    Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors. PMID:11544202

  13. DNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data.

    PubMed

    Tsuji, Junko; Weng, Zhiping

    2016-01-01

    With the rapid accumulation of publicly available small RNA sequencing datasets, third-party meta-analysis across many datasets is becoming increasingly powerful. Although removing the 3´ adapter is an essential step for small RNA sequencing analysis, the adapter sequence information is not always available in the metadata. The information can be also erroneous even when it is available. In this study, we developed DNApi, a lightweight Python software package that predicts the 3´ adapter sequence de novo and provides the user with cleansed small RNA sequences ready for down stream analysis. Tested on 539 publicly available small RNA libraries accompanied with 3´ adapter sequences in their metadata, DNApi shows near-perfect accuracy (98.5%) with fast runtime (~2.85 seconds per library) and efficient memory usage (~43 MB on average). In addition to 3´ adapter prediction, it is also important to classify whether the input small RNA libraries were already processed, i.e. the 3´ adapters were removed. DNApi perfectly judged that given another batch of datasets, 192 publicly available processed libraries were "ready-to-map" small RNA sequence. DNApi is compatible with Python 2 and 3, and is available at https://github.com/jnktsj/DNApi. The 731 small RNA libraries used for DNApi evaluation were from human tissues and were carefully and manually collected. This study also provides readers with the curated datasets that can be integrated into their studies.

  14. DNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data

    PubMed Central

    Tsuji, Junko; Weng, Zhiping

    2016-01-01

    With the rapid accumulation of publicly available small RNA sequencing datasets, third-party meta-analysis across many datasets is becoming increasingly powerful. Although removing the 3´ adapter is an essential step for small RNA sequencing analysis, the adapter sequence information is not always available in the metadata. The information can be also erroneous even when it is available. In this study, we developed DNApi, a lightweight Python software package that predicts the 3´ adapter sequence de novo and provides the user with cleansed small RNA sequences ready for down stream analysis. Tested on 539 publicly available small RNA libraries accompanied with 3´ adapter sequences in their metadata, DNApi shows near-perfect accuracy (98.5%) with fast runtime (~2.85 seconds per library) and efficient memory usage (~43 MB on average). In addition to 3´ adapter prediction, it is also important to classify whether the input small RNA libraries were already processed, i.e. the 3´ adapters were removed. DNApi perfectly judged that given another batch of datasets, 192 publicly available processed libraries were “ready-to-map” small RNA sequence. DNApi is compatible with Python 2 and 3, and is available at https://github.com/jnktsj/DNApi. The 731 small RNA libraries used for DNApi evaluation were from human tissues and were carefully and manually collected. This study also provides readers with the curated datasets that can be integrated into their studies. PMID:27736901

  15. Complete cDNA and derived amino acid sequence of human factor V

    SciTech Connect

    Jenny, R.J.; Pittman, D.D.; Toole, J.J.; Kriz, R.W.; Aldape, R.A.; Hewick, R.M.; Kaufman, R.J.; Mann, K.G.

    1987-07-01

    cDNA clones encoding human factor V have been isolated from an oligo(dT)-primed human fetal liver cDNA library prepared with vector Charon 21A. The cDNA sequence of factor V from three overlapping clones includes a 6672-base-pair (bp) coding region, a 90-bp 5' untranslated region, and a 163-bp 3' untranslated region within which is a poly(A)tail. The deduced amino acid sequence consists of 2224 amino acids inclusive of a 28-amino acid leader peptide. Direct comparison with human factor VIII reveals considerable homology between proteins in amino acid sequence and domain structure: a triplicated A domain and duplicated C domain show approx. 40% identity with the corresponding domains in factor VIII. As in factor VIII, the A domains of factor V share approx. 40% amino acid-sequence homology with the three highly conserved domains in ceruloplasmin. The B domain of factor V contains 35 tandem and approx. 9 additional semiconserved repeats of nine amino acids of the form Asp-Leu-Ser-Gln-Thr-Thr/Asn-Leu-Ser-Pro and 2 additional semiconserved repeats of 17 amino acids. Factor V contains 37 potential N-linked glycosylation sites, 25 of which are in the B domain, and a total of 19 cysteine residues.

  16. An analysis of amino acid sequences surrounding archaeal glycoprotein sequons.

    PubMed

    Abu-Qarn, Mehtap; Eichler, Jerry

    2007-05-01

    Despite having provided the first example of a prokaryal glycoprotein, little is known of the rules governing the N-glycosylation process in Archaea. As in Eukarya and Bacteria, archaeal N-glycosylation takes place at the Asn residues of Asn-X-Ser/Thr sequons. Since not all sequons are utilized, it is clear that other factors, including the context in which a sequon exists, affect glycosylation efficiency. As yet, the contribution to N-glycosylation made by sequon-bordering residues and other related factors in Archaea remains unaddressed. In the following, the surroundings of Asn residues confirmed by experiment as modified were analyzed in an attempt to define sequence rules and requirements for archaeal N-glycosylation.

  17. All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences

    PubMed Central

    Hayat, Sikander; Sander, Chris; Marks, Debora S.

    2015-01-01

    Transmembrane β-barrels (TMBs) carry out major functions in substrate transport and protein biogenesis but experimental determination of their 3D structure is challenging. Encouraged by successful de novo 3D structure prediction of globular and α-helical membrane proteins from sequence alignments alone, we developed an approach to predict the 3D structure of TMBs. The approach combines the maximum-entropy evolutionary coupling method for predicting residue contacts (EVfold) with a machine-learning approach (boctopus2) for predicting β-strands in the barrel. In a blinded test for 19 TMB proteins of known structure that have a sufficient number of diverse homologous sequences available, this combined method (EVfold_bb) predicts hydrogen-bonded residue pairs between adjacent β-strands at an accuracy of ∼70%. This accuracy is sufficient for the generation of all-atom 3D models. In the transmembrane barrel region, the average 3D structure accuracy [template-modeling (TM) score] of top-ranked models is 0.54 (ranging from 0.36 to 0.85), with a higher (44%) number of residue pairs in correct strand–strand registration than in earlier methods (18%). Although the nonbarrel regions are predicted less accurately overall, the evolutionary couplings identify some highly constrained loop residues and, for FecA protein, the barrel including the structure of a plug domain can be accurately modeled (TM score = 0.68). Lower prediction accuracy tends to be associated with insufficient sequence information and we therefore expect increasing numbers of β-barrel families to become accessible to accurate 3D structure prediction as the number of available sequences increases. PMID:25858953

  18. All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences.

    PubMed

    Hayat, Sikander; Sander, Chris; Marks, Debora S; Elofsson, Arne

    2015-04-28

    Transmembrane β-barrels (TMBs) carry out major functions in substrate transport and protein biogenesis but experimental determination of their 3D structure is challenging. Encouraged by successful de novo 3D structure prediction of globular and α-helical membrane proteins from sequence alignments alone, we developed an approach to predict the 3D structure of TMBs. The approach combines the maximum-entropy evolutionary coupling method for predicting residue contacts (EVfold) with a machine-learning approach (boctopus2) for predicting β-strands in the barrel. In a blinded test for 19 TMB proteins of known structure that have a sufficient number of diverse homologous sequences available, this combined method (EVfold_bb) predicts hydrogen-bonded residue pairs between adjacent β-strands at an accuracy of ∼70%. This accuracy is sufficient for the generation of all-atom 3D models. In the transmembrane barrel region, the average 3D structure accuracy [template-modeling (TM) score] of top-ranked models is 0.54 (ranging from 0.36 to 0.85), with a higher (44%) number of residue pairs in correct strand-strand registration than in earlier methods (18%). Although the nonbarrel regions are predicted less accurately overall, the evolutionary couplings identify some highly constrained loop residues and, for FecA protein, the barrel including the structure of a plug domain can be accurately modeled (TM score = 0.68). Lower prediction accuracy tends to be associated with insufficient sequence information and we therefore expect increasing numbers of β-barrel families to become accessible to accurate 3D structure prediction as the number of available sequences increases.

  19. Using fourier spectrum analysis and pseudo amino acid composition for prediction of membrane protein types.

    PubMed

    Liu, Hui; Yang, Jie; Wang, Meng; Xue, Li; Chou, Kuo-Chen

    2005-08-01

    Membrane proteins are generally classified into the following five types: (1) type I membrane protein, (2) type II membrane protein, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins, and (5) GPI-anchored membrane proteins. Given the sequence of an uncharacterized membrane protein, how can we identify which one of the above five types it belongs to? This is important because the biological function of a membrane protein is closely correlated with its type. Particularly, with the explosion of protein sequences entering into databanks, it is in high demand to develop an automated method to address this problem. To realize this, the key is to catch the statistical characteristics for each of the five types. However, it is not easy because they are buried in a pile of long and complicated sequences. In this paper, based on the concept of the pseudo amino acid composition (Chou, K. C. (2001). PROTEINS: Structure, Function, and Genetics 43: 246-255), the technique of Fourier spectrum analysis is introduced. By doing so, the sample of a protein is represented by a set of discrete components that can incorporate a considerable amount of the sequence order effects as well as its amino acid composition information. On the basis of such a statistical frame, the support vector machine (SVM) is introduced to perform predictions. High success rates were yielded by the self-consistency test, jackknife test, and independent dataset test, suggesting that the current approach holds a promising potential to become a high throughput tool for membrane protein type prediction as well as other related areas.

  20. beta Structure of aqueous staphylococcal enterotoxin B by spectropolarimetry and sequence-based conformational predictions.

    PubMed

    Muñoz, P A; Warren, J R; Noelken, M E

    1976-10-19

    Conformations of the globular protein staphylococcal enterotoxin B have been examined experimentally by ultraviolet circular dichroism (CD) and visible optical rotatory dispersion (ORD). Chen-Yang-Chau analysis (Chen, Y.-H., Yang, J.T., and Chau, K. H. (1974), Biochemistry 13, 3350) of the far-ultraviolet CD spectrum of native enterotoxin B revealed (assuming an average helix length of 11 residues) 9% alpha helix, 38% beta structure, and 53% random coil. A fourfold increase in alpha-helix was observed for enterotoxin exposed to 0.2% sodium dodecyl sulfate, behavior typical for globular proteins of low helical content. Values of -40 to -50 for the Moffitt-Yang parameter b0 calculated from visible ORD suggested 6-13% alpha helix in native enterotoxin. Application of a new predictive model (Chou, P. Y., and Fasman, G. D. (1974), Biochemistry 13,222) to the amino acid sequence of enterotoxin B indicated 11% alpha helix, 34% beta structure, and 55% coil in native enterotoxin. The excellent agreement for the amount of alpha and beta conformation utilizing different optical and predictive methods indicates beta structure as the dominant secondary structure in native enterotoxin B. Most of the beta structure is predicted by Chou-Fasman analysis to reside in two large regions of antiparallel beta sheet involving residues 81-148 and residues 184-217. Such highly cooperative regions of anti-parallel beta sheet account for the slow unfolding of enterotoxin B in concentrated guanidine hydrochloride and rapid folding of guanidine hydrochloride denatured enterotoxin B to native conformation(s) (Warren, J.R., Spero, L., and Metzger, J. F. (1974), Biochemistry 13, 1678). A more than twofold increase in alpha-helix content with a small diminution in beta structure was detected by CD and ORD upon acidification of aqueous enterotoxin to pH 2.5. Thus, the beta structure of enterotoxin B appears to resist isothermal denaturation and constitutes a stable interior core of structure in the

  1. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification

    PubMed Central

    Sinclair, Robert M.; Ravantti, Janne J.

    2017-01-01

    ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids

  2. Sequence features of viral and human Internal Ribosome Entry Sites predictive of their activity.

    PubMed

    Gritsenko, Alexey A; Weingarten-Gabbay, Shira; Elias-Kirma, Shani; Nir, Ronit; de Ridder, Dick; Segal, Eran

    2017-09-18

    Translation of mRNAs through Internal Ribosome Entry Sites (IRESs) has emerged as a prominent mechanism of cellular and viral initiation. It supports cap-independent translation of select cellular genes under normal conditions, and in conditions when cap-dependent translation is inhibited. IRES structure and sequence are believed to be involved in this process. However due to the small number of IRESs known, there have been no systematic investigations of the determinants of IRES activity. With the recent discovery of thousands of novel IRESs in human and viruses, the next challenge is to decipher the sequence determinants of IRES activity. We present the first in-depth computational analysis of a large body of IRESs, exploring RNA sequence features predictive of IRES activity. We identified predictive k-mer features resembling IRES trans-acting factor (ITAF) binding motifs across human and viral IRESs, and found that their effect on expression depends on their sequence, number and position. Our results also suggest that the architecture of retroviral IRESs differs from that of other viruses, presumably due to their exposure to the nuclear environment. Finally, we measured IRES activity of synthetically designed sequences to confirm our prediction of increasing activity as a function of the number of short IRES elements.

  3. Classification of mouse VK groups based on the partial amino acid sequence to the first invariant tryptophan: impact of 14 new sequences from IgG myeloma proteins.

    PubMed

    Potter, M; Newell, J B; Rudikoff, S; Haber, E

    1982-12-01

    Fourteen new VK sequences derived from BALB/c IgG myeloma proteins were determined to the first invariant tryptophan (Trp 35). These partial sequences were compared with 65 other published VK sequences using a computer program. The 79 sequences were organized according to the length of the sequence from the amino terminus to the first invariant tryptophan (Trp 35), into seven groups (33, 34, 35, 36, 39, 40 and 41aa). A distance matrix of all 79 sequences was then computed, i.e. the number of amino acid substitutions necessary to convert one sequence to another was determined. From these data a dendrogram was constructed. Most of the VK sequences fell into clusters or closely related groups. The definition of a sequence group is arbitrary but facilitates the classification of VK proteins. We used 12 substitutions as the basis for defining a sequence group based on the known number of substitutions that are found in the VK21 proteins. By this criterion there were 18 groups in the Trp 35 dendrogram. Twelve of the 14 new sequences fell into one of these sequence groups; two formed new sequence groups. Collective amino acid sequencing is still encountering new VK structures indicating more sequences will be required to attain an accurate estimate of the total number of VK groups. Updated dendrograms can be quickly generated to include newly generated sequences.

  4. Molecular cloning and sequencing of the human erythrocyte 2,3-bisphosphoglycerate mutase cDNA: revised amino acid sequence.

    PubMed Central

    Joulin, V; Peduzzi, J; Roméo, P H; Rosa, R; Valentin, C; Dubart, A; Lapeyre, B; Blouquit, Y; Garel, M C; Goossens, M

    1986-01-01

    The human erythrocyte 2,3-bisphosphoglycerate mutase (BPGM) is a multifunctional enzyme which controls the metabolism of 2,3-diphosphoglycerate, the main allosteric effector of haemoglobin. Several cDNA banks were constructed from reticulocyte mRNA, either by conventional cloning methods in pBR322 and screening with specific mixed oligonucleotide probes, or in the expression vector lambda gt 11. The largest cDNA isolated contained 1673 bases [plus the poly(A) tail], which is slightly smaller than the size of the intact mRNA as estimated by Northern blot analysis (approximately 1800 bases). This cDNA encodes for a protein of 258 residues; the protein yielded 34 tryptic peptides which were subsequently isolated by h.p.l.c. Our nucleotide sequence data were entirely confirmed by the amino acid composition of these tryptic peptides and reveal several major differences from the published sequence; the revised amino acid sequence of human BPGM is presented. These findings represent the first step in the study of the expression and regulation of this enzyme as a specific marker of the erythroid cell line. Images Fig. 5. PMID:3023066

  5. Cross-species protein sequence and gene structure prediction with fine-tuned Webscipio 2.0 and Scipio

    PubMed Central

    2011-01-01

    Background Obtaining transcripts of homologs of closely related organisms and retrieving the reconstructed exon-intron patterns of the genes is a very important process during the analysis of the evolution of a protein family and the comparative analysis of the exon-intron structure of a certain gene from different species. Due to the ever-increasing speed of genome sequencing, the gap to genome annotation is growing. Thus, tools for the correct prediction and reconstruction of genes in related organisms become more and more important. The tool Scipio, which can also be used via the graphical interface WebScipio, performs significant hit processing of the output of the Blat program to account for sequencing errors, missing sequence, and fragmented genome assemblies. However, Scipio has so far been limited to high sequence similarity and unable to reconstruct short exons. Results Scipio and WebScipio have fundamentally been extended to better reconstruct very short exons and intron splice sites and to be better suited for cross-species gene structure predictions. The Needleman-Wunsch algorithm has been implemented for the search for short parts of the query sequence that were not recognized by Blat. Those regions might either be short exons, divergent sequence at intron splice sites, or very divergent exons. We have shown the benefit and use of new parameters with several protein examples from completely different protein families in searches against species from several kingdoms of the eukaryotes. The performance of the new Scipio version has been tested in comparison with several similar tools. Conclusions With the new version of Scipio very short exons, terminal and internal, of even just one amino acid can correctly be reconstructed. Scipio is also able to correctly predict almost all genes in cross-species searches even if the ancestors of the species separated more than 100 Myr ago and if the protein sequence identity is below 80%. For our test cases Scipio

  6. Plant mitochondrial nucleic acid sequences as a tool for phylogenetic analysis.

    PubMed Central

    Hiesel, R; von Haeseler, A; Brennicke, A

    1994-01-01

    To evaluate the potential of mitochondrial nucleic acid sequences as a phylogenetic tool, we have analyzed cytochrome oxidase subunit III (coxIII) coding sequences in representatives of the major groups of land plants. The phylogenetic tree derived from these mitochondrial sequences confirms the monophyletic origin of land plant mitochondria with the general order and descent of land plants deduced by other molecular, physiological, and morphological traits. The mitochondrial sequences strongly suggest a close phylogenetic relationship between Bryophyta and Lycopodiatae, whereas Psilophytatae cluster with the other vascular plants. In addition to the high sequence similarity, both Hepaticophytina and Lycopodiatae contain a related intron in the coxIII gene that, to our knowledge, is not found in any other plant species. The slowly evolving mitochondrial sequences of plants are shown to provide a useful phylogenetic tool to evaluate distant evolutionary relationships within this kingdom. PMID:7507251

  7. Applying a predict-observe-explain sequence in teaching of buoyant force

    NASA Astrophysics Data System (ADS)

    Radovanović, Jelena; Sliško, Josip

    2013-01-01

    An active learning sequence based on the predict-observe-explain teaching strategy is applied to a lesson on buoyant force. The results obtained clearly justify the use of this teaching method and suggest devising a series of activities to enable more effective removal of students’ commonly held alternative conceptions regarding floating and sinking.

  8. Combining structure and sequence information allows automated prediction of substrate specificities within enzyme families.

    PubMed

    Röttig, Marc; Rausch, Christian; Kohlbacher, Oliver

    2010-01-08

    An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/.

  9. Sequence signatures extracted from proximal promoters can be used to predict distal enhancers

    PubMed Central

    2013-01-01

    Background Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. Results We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% of the tissues, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning distal non-coding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes. Liver enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity in the mouse liver, whereas a control set of five sequences was completely negative. Conclusions We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific signatures that can be learned and used for the de novo prediction of enhancers. PMID:24156763

  10. Combining Structure and Sequence Information Allows Automated Prediction of Substrate Specificities within Enzyme Families

    PubMed Central

    Röttig, Marc; Rausch, Christian; Kohlbacher, Oliver

    2010-01-01

    An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/. PMID:20072606

  11. Applying a Predict-Observe-Explain Sequence in Teaching of Buoyant Force

    ERIC Educational Resources Information Center

    Radovanovic, Jelena; Slisko, Josip

    2013-01-01

    An active learning sequence based on the predict-observe-explain teaching strategy is applied to a lesson on buoyant force. The results obtained clearly justify the use of this teaching method and suggest devising a series of activities to enable more effective removal of students' commonly held alternative conceptions regarding floating and…

  12. Linguistic and Spatial Skills Predict Early Arithmetic Development via Counting Sequence Knowledge

    ERIC Educational Resources Information Center

    Zhang, Xiao; Koponen, Tuire; Räsänen, Pekka; Aunola, Kaisa; Lerkkanen, Marja-Kristiina; Nurmi, Jari-Erik

    2014-01-01

    Utilizing a longitudinal sample of Finnish children (ages 6-10), two studies examined how early linguistic (spoken vs. written) and spatial skills predict later development of arithmetic, and whether counting sequence knowledge mediates these associations. In Study 1 (N = 1,880), letter knowledge and spatial visualization, measured in…

  13. Linguistic and Spatial Skills Predict Early Arithmetic Development via Counting Sequence Knowledge

    ERIC Educational Resources Information Center

    Zhang, Xiao; Koponen, Tuire; Räsänen, Pekka; Aunola, Kaisa; Lerkkanen, Marja-Kristiina; Nurmi, Jari-Erik

    2014-01-01

    Utilizing a longitudinal sample of Finnish children (ages 6-10), two studies examined how early linguistic (spoken vs. written) and spatial skills predict later development of arithmetic, and whether counting sequence knowledge mediates these associations. In Study 1 (N = 1,880), letter knowledge and spatial visualization, measured in…

  14. Applying a Predict-Observe-Explain Sequence in Teaching of Buoyant Force

    ERIC Educational Resources Information Center

    Radovanovic, Jelena; Slisko, Josip

    2013-01-01

    An active learning sequence based on the predict-observe-explain teaching strategy is applied to a lesson on buoyant force. The results obtained clearly justify the use of this teaching method and suggest devising a series of activities to enable more effective removal of students' commonly held alternative conceptions regarding floating and…

  15. Predicting Salmonella enterica subsp. enterica Serotypes by Repetitive Extragenic Palindromic Sequence-Based PCR

    USDA-ARS?s Scientific Manuscript database

    The DiversiLabTM System, which employs repetitive extragenic palindromic sequence-based PCR (rep-PCR) to genotype microorganisms, was evaluated as a method to predict the serotype of Salmonella isolates. Two hundred and thirty-three Salmonella isolates belonging to 14 frequently isolated serotypes f...

  16. Detection and isolation of nucleic acid sequences using competitive hybridization probes

    DOEpatents

    Lucas, J.N.; Straume, T.; Bogen, K.T.

    1997-04-01

    A method for detecting a target nucleic acid sequence in a sample is provided using hybridization probes which competitively hybridize to a target nucleic acid. According to the method, a target nucleic acid sequence is hybridized to first and second hybridization probes which are complementary to overlapping portions of the target nucleic acid sequence, the first hybridization probe including a first complexing agent capable of forming a binding pair with a second complexing agent and the second hybridization probe including a detectable marker. The first complexing agent attached to the first hybridization probe is contacted with a second complexing agent, the second complexing agent being attached to a solid support such that when the first and second complexing agents are attached, target nucleic acid sequences hybridized to the first hybridization probe become immobilized on to the solid support. The immobilized target nucleic acids are then separated and detected by detecting the detectable marker attached to the second hybridization probe. A kit for performing the method is also provided. 7 figs.

  17. Detection and isolation of nucleic acid sequences using competitive hybridization probes

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    1997-01-01

    A method for detecting a target nucleic acid sequence in a sample is provided using hybridization probes which competitively hybridize to a target nucleic acid. According to the method, a target nucleic acid sequence is hybridized to first and second hybridization probes which are complementary to overlapping portions of the target nucleic acid sequence, the first hybridization probe including a first complexing agent capable of forming a binding pair with a second complexing agent and the second hybridization probe including a detectable marker. The first complexing agent attached to the first hybridization probe is contacted with a second complexing agent, the second complexing agent being attached to a solid support such that when the first and second complexing agents are attached, target nucleic acid sequences hybridized to the first hybridization probe become immobilized on to the solid support. The immobilized target nucleic acids are then separated and detected by detecting the detectable marker attached to the second hybridization probe. A kit for performing the method is also provided.

  18. Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours

    PubMed Central

    Yamada, Takuji; Waller, Alison S; Raes, Jeroen; Zelezniak, Aleksej; Perchat, Nadia; Perret, Alain; Salanoubat, Marcel; Patil, Kiran R; Weissenbach, Jean; Bork, Peer

    2012-01-01

    Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence–function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction. PMID:22569339

  19. Prediction of high-risk types of human papillomaviruses using statistical model of protein "sequence space".

    PubMed

    Wang, Cong; Hai, Yabing; Liu, Xiaoqing; Liu, Nanfang; Yao, Yuhua; He, Pingan; Dai, Qi

    2015-01-01

    Discrimination of high-risk types of human papillomaviruses plays an important role in the diagnosis and remedy of cervical cancer. Recently, several computational methods have been proposed based on protein sequence-based and structure-based information, but the information of their related proteins has not been used until now. In this paper, we proposed using protein "sequence space" to explore this information and used it to predict high-risk types of HPVs. The proposed method was tested on 68 samples with known HPV types and 4 samples without HPV types and further compared with the available approaches. The results show that the proposed method achieved the best performance among all the evaluated methods with accuracy 95.59% and F1-score 90.91%, which indicates that protein "sequence space" could potentially be used to improve prediction of high-risk types of HPVs.

  20. Prediction of aggregation rate and aggregation-prone segments in polypeptide sequences

    PubMed Central

    Tartaglia, Gian Gaetano; Cavalli, Andrea; Pellarin, Riccardo; Caflisch, Amedeo

    2005-01-01

    The reliable identification of β-aggregating stretches in protein sequences is essential for the development of therapeutic agents for Alzheimer’s and Parkinson’s diseases, as well as other pathological conditions associated with protein deposition. Here, a model based on physicochemical properties and computational design of β-aggregating peptide sequences is shown to be able to predict the aggregation rate over a large set of natural polypeptide sequences. Furthermore, the model identifies aggregation-prone fragments within proteins and predicts the parallel or anti-parallel β-sheet organization in fibrils. The model recognizes different β-aggregating segments in mammalian and nonmammalian prion proteins, providing insights into the species barrier for the transmission of the prion disease. PMID:16195556

  1. Airborne Precursors Predict Maternal Serum Perfluoroalkyl Acid Concentrations.

    PubMed

    Makey, Colleen M; Webster, Thomas F; Martin, Jonathan W; Shoeib, Mahiba; Harner, Tom; Dix-Cooper, Linda; Webster, Glenys M

    2017-07-05

    Human exposure to persistent perfluoroalkyl acids (PFAAs), including perfluorooctanoic acid (PFOA), perfluorononanoic acid (PFNA), and perfluorooctanesulfonate (PFOS), can occur directly from contaminated food, water, air, and dust. However, precursors to PFAAs (PreFAAs), such as dipolyfluoroalkyl phosphates (diPAPs), fluorotelomer alcohols (FTOHs), perfluorooctyl sulfonamides (FOSAs), and sulfonamidoethanols (FOSEs), which can be biotransformed to PFAAs, may also be a source of exposure. PFAAs were analyzed in 50 maternal sera samples collected in 2007-2008 from participants in Vancouver, Canada, while PFAAs and PreFAAs were measured in matching samples of residential bedroom air collected by passive sampler and in sieved vacuum dust (<150 μm). Concentrations of PreFAAs were higher than for PFAAs in air and dust. Positive associations were discovered between airborne 10:2 FTOH and serum PFOA and PFNA and between airborne MeFOSE and serum PFOS. On average, serum PFOS concentrations were 2.3 ng/mL (95%CI: 0.40, 4.3) higher in participants with airborne MeFOSE concentrations in the highest tertile relative to the lowest tertile. Among all PFAAs, only PFNA in air and vacuum dust predicted serum PFNA. Results suggest that airborne PFAA precursors were a source of PFOA, PFNA, and PFOS exposure in this population.

  2. Amino acid sequence around the active-site serine residue in the acyltransferase domain of goat mammary fatty acid synthetase.

    PubMed Central

    Mikkelsen, J; Højrup, P; Rasmussen, M M; Roepstorff, P; Knudsen, J

    1985-01-01

    Goat mammary fatty acid synthetase was labelled in the acyltransferase domain by formation of O-ester intermediates by incubation with [1-14C]acetyl-CoA and [2-14C]malonyl-CoA. Tryptic-digest and CNBr-cleavage peptides were isolated and purified by high-performance reverse-phase and ion-exchange liquid chromatography. The sequences of the malonyl- and acetyl-labelled peptides were shown to be identical. The results confirm the hypothesis that both acetyl and malonyl groups are transferred to the mammalian fatty acid synthetase complex by the same transferase. The sequence is compared with those of other fatty acid synthetase transferases. PMID:3922356

  3. Prediction and identification of some forbidden lines in the Ne I sequence. [in solar spectrum

    NASA Technical Reports Server (NTRS)

    Kastner, S. O.

    1974-01-01

    A magnetic quadrupole transition which according to a prediction by Garstang (1969) is to have an appreciable transition probability in the higher ions of the Ne I sequence has recently been observed in Fe XVII with high resolution by Parkinson (1973), at 17.086 A. Values of an interval predicted by calculations of Crance (1973) are plotted in a graph. Interval values obtained from the curve are used to predict the values of certain transition wavelengths in the ions Si V Cr XV.

  4. In-silico prediction of disorder content using hybrid sequence representation

    PubMed Central

    2011-01-01

    Background Intrinsically disordered proteins play important roles in various cellular activities and their prevalence was implicated in a number of human diseases. The knowledge of the content of the intrinsic disorder in proteins is useful for a variety of studies including estimation of the abundance of disorder in protein families, classes, and complete proteomes, and for the analysis of disorder-related protein functions. The above investigations currently utilize the disorder content derived from the per-residue disorder predictions. We show that these predictions may over-or under-predict the overall amount of disorder, which motivates development of novel tools for direct and accurate sequence-based prediction of the disorder content. Results We hypothesize that sequence-level aggregation of input information may provide more accurate content prediction when compared with the content extracted from the local window-based residue-level disorder predictors. We propose a novel predictor, DisCon, that takes advantage of a small set of 29 custom-designed descriptors that aggregate and hybridize information concerning sequence, evolutionary profiles, and predicted secondary structure, solvent accessibility, flexibility, and annotation of globular domains. Using these descriptors and a ridge regression model, DisCon predicts the content with low, 0.05, mean squared error and high, 0.68, Pearson correlation. This is a statistically significant improvement over the content computed from outputs of ten modern disorder predictors on a test dataset with proteins that share low sequence identity with the training sequences. The proposed predictive model is analyzed to discuss factors related to the prediction of the disorder content. Conclusions DisCon is a high-quality alternative for high-throughput annotation of the disorder content. We also empirically demonstrate that the DisCon's predictions can be used to improve binary annotations of the disordered residues from

  5. Ligation with nucleic acid sequence-based amplification.

    PubMed

    Ong, Carmichael; Tai, Warren; Sarma, Aartik; Opal, Steven M; Artenstein, Andrew W; Tripathi, Anubhav

    2012-01-01

    This work presents a novel method for detecting nucleic acid targets using a ligation step along with an isothermal, exponential amplification step. We use an engineered ssDNA with two variable regions on the ends, allowing us to design the probe for optimal reaction kinetics and primer binding. This two-part probe is ligated by T4 DNA Ligase only when both parts bind adjacently to the target. The assay demonstrates that the expected 72-nt RNA product appears only when the synthetic target, T4 ligase, and both probe fragments are present during the ligation step. An extraneous 38-nt RNA product also appears due to linear amplification of unligated probe (P3), but its presence does not cause a false-positive result. In addition, 40 mmol/L KCl in the final amplification mix was found to be optimal. It was also found that increasing P5 in excess of P3 helped with ligation and reduced the extraneous 38-nt RNA product. The assay was also tested with a single nucleotide polymorphism target, changing one base at the ligation site. The assay was able to yield a negative signal despite only a single-base change. Finally, using P3 and P5 with longer binding sites results in increased overall sensitivity of the reaction, showing that increasing ligation efficiency can improve the assay overall. We believe that this method can be used effectively for a number of diagnostic assays.

  6. Amino acid sequence and some properties of phytolacain G, a cysteine protease from growing fruit of pokeweed, Phytolacca americana.

    PubMed

    Uchikoba, T; Arima, K; Yonezawa, H; Shimada, M; Kaneda, M

    2000-10-18

    A protease, phytolacain G, has been found to appear on CM-Sepharose ion-exchange chromatography of greenish small-size fruits of pokeweed, Phytolacca americana L, from ca. 2 weeks after flowering, and increases during fruit enlargement. Reddish ripe fruit of the pokeweed contained both phytolacain G and R. The molecular mass of phytolacain G was estimated to be 25.5 kDa by SDS-PAGE. Its amino acid sequence was reconstructed by automated sequence analysis of the peptides obtained after cleavage with Achromobacter protease I, chymotrypsin, and cyanogen bromide. The enzyme is composed of 216 amino acid residues, of which it shares 152 identical amino acid residues (70%) with phytolacain R, 126 (58%) with melain G, 108 (50%) with papain, 106 (49%) with actinidain, and 96 (44%) with stem bromelain. The amino acid residues forming the substrate binding S(2) pocket of papain, Tyr67, Pro68, Trp69, Val133, and Phe207, were predicted to be replaced by Trp, Met, His, Ala, and Ser in phytolacain G, respectively. As a consequence of these substitutions, the S(2) pocket is expected to be less hydrophobic in phytolacain G than in papain.

  7. FrameD: A flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences.

    PubMed

    Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick

    2003-07-01

    We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.

  8. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences

    PubMed Central

    Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick

    2003-01-01

    We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms. PMID:12824407

  9. Comprehensive red blood cell and platelet antigen prediction from whole genome sequencing: proof of principle

    PubMed Central

    Westhoff, Connie M.; Uy, Jon Michael; Aguad, Maria; Smeland‐Wagman, Robin; Kaufman, Richard M.; Rehm, Heidi L.; Green, Robert C.; Silberstein, Leslie E.

    2015-01-01

    BACKGROUND There are 346 serologically defined red blood cell (RBC) antigens and 33 serologically defined platelet (PLT) antigens, most of which have known genetic changes in 45 RBC or six PLT genes that correlate with antigen expression. Polymorphic sites associated with antigen expression in the primary literature and reference databases are annotated according to nucleotide positions in cDNA. This makes antigen prediction from next‐generation sequencing data challenging, since it uses genomic coordinates. STUDY DESIGN AND METHODS The conventional cDNA reference sequences for all known RBC and PLT genes that correlate with antigen expression were aligned to the human reference genome. The alignments allowed conversion of conventional cDNA nucleotide positions to the corresponding genomic coordinates. RBC and PLT antigen prediction was then performed using the human reference genome and whole genome sequencing (WGS) data with serologic confirmation. RESULTS Some major differences and alignment issues were found when attempting to convert the conventional cDNA to human reference genome sequences for the following genes: ABO, A4GALT, RHD, RHCE, FUT3, ACKR1 (previously DARC), ACHE, FUT2, CR1, GCNT2, and RHAG. However, it was possible to create usable alignments, which facilitated the prediction of all RBC and PLT antigens with a known molecular basis from WGS data. Traditional serologic typing for 18 RBC antigens were in agreement with the WGS‐based antigen predictions, providing proof of principle for this approach. CONCLUSION Detailed mapping of conventional cDNA annotated RBC and PLT alleles can enable accurate prediction of RBC and PLT antigens from whole genomic sequencing data. PMID:26634332

  10. Systematic discovery of novel eukaryotic transcriptional regulators using sequence homology independent prediction

    DOE PAGES

    Bossi, Flavia; Fan, Jue; Xiao, Jun; ...

    2017-06-26

    Here, the molecular function of a gene is most commonly inferred by sequence similarity. Therefore, genes that lack sufficient sequence similarity to characterized genes (such as certain classes of transcriptional regulators) are difficult to classify using most function prediction algorithms and have remained uncharacterized. As a result, to identify novel transcriptional regulators systematically, we used a feature-based pipeline to screen protein families of unknown function. This method predicted 43 transcriptional regulator families in Arabidopsis thaliana, 7 families in Drosophila melanogaster, and 9 families in Homo sapiens. Literature curation validated 12 of the predicted families to be involved in transcriptional regulation.more » We tested 33 out of the 195 Arabidopsis putative transcriptional regulators for their ability to activate transcription of a reporter gene in planta and found twelve coactivators, five of which had no prior literature support. To investigate mechanisms of action in which the predicted regulators might work, we looked for interactors of an Arabidopsis candidate that did not show transactivation activity in planta and found that it might work with other members of its own family and a subunit of the Polycomb Repressive Complex 2 to regulate transcription. Our results demonstrate the feasibility of assigning molecular function to proteins of unknown function without depending on sequence similarity. In particular, we identified novel transcriptional regulators using biological features enriched in transcription factors. The predictions reported here should accelerate the characterization of novel regulators.« less

  11. Learning to Predict miRNA-mRNA Interactions from AGO CLIP Sequencing and CLASH Data

    PubMed Central

    Lu, Yuheng; Leslie, Christina S.

    2016-01-01

    Recent technologies like AGO CLIP sequencing and CLASH enable direct transcriptome-wide identification of AGO binding and miRNA target sites, but the most widely used miRNA target prediction algorithms do not exploit these data. Here we use discriminative learning on AGO CLIP and CLASH interactions to train a novel miRNA target prediction model. Our method combines two SVM classifiers, one to predict miRNA-mRNA duplexes and a second to learn a binding model of AGO’s local UTR sequence preferences and positional bias in 3’UTR isoforms. The duplex SVM model enables the prediction of non-canonical target sites and more accurately resolves miRNA interactions from AGO CLIP data than previous methods. The binding model is trained using a multi-task strategy to learn context-specific and common AGO sequence preferences. The duplex and common AGO binding models together outperform existing miRNA target prediction algorithms on held-out binding data. Open source code is available at https://bitbucket.org/leslielab/chimiric. PMID:27438777

  12. Comparative Analysis of Predicted Plastid-Targeted Proteomes of Sequenced Higher Plant Genomes

    PubMed Central

    Schaeffer, Scott; Harper, Artemus; Raja, Rajani; Jaiswal, Pankaj; Dhingra, Amit

    2014-01-01

    Plastids are actively involved in numerous plant processes critical to growth, development and adaptation. They play a primary role in photosynthesis, pigment and monoterpene synthesis, gravity sensing, starch and fatty acid synthesis, as well as oil, and protein storage. We applied two complementary methods to analyze the recently published apple genome (Malus × domestica) to identify putative plastid-targeted proteins, the first using TargetP and the second using a custom workflow utilizing a set of predictive programs. Apple shares roughly 40% of its 10,492 putative plastid-targeted proteins with that of the Arabidopsis (Arabidopsis thaliana) plastid-targeted proteome as identified by the Chloroplast 2010 project and ∼57% of its entire proteome with Arabidopsis. This suggests that the plastid-targeted proteomes between apple and Arabidopsis are different, and interestingly alludes to the presence of differential targeting of homologs between the two species. Co-expression analysis of 2,224 genes encoding putative plastid-targeted apple proteins suggests that they play a role in plant developmental and intermediary metabolism. Further, an inter-specific comparison of Arabidopsis, Prunus persica (Peach), Malus × domestica (Apple), Populus trichocarpa (Black cottonwood), Fragaria vesca (Woodland Strawberry), Solanum lycopersicum (Tomato) and Vitis vinifera (Grapevine) also identified a large number of novel species-specific plastid-targeted proteins. This analysis also revealed the presence of alternatively targeted homologs across species. Two separate analyses revealed that a small subset of proteins, one representing 289 protein clusters and the other 737 unique protein sequences, are conserved between seven plastid-targeted angiosperm proteomes. Majority of the novel proteins were annotated to play roles in stress response, transport, catabolic processes, and cellular component organization. Our results suggest that the current state of knowledge regarding

  13. Computational simulations of protein folding to engineer amino acid sequences to encourage desired supersecondary structure formation.

    PubMed

    Gerstman, Bernard S; Chapagain, Prem P

    2013-01-01

    The dynamics of protein folding are complicated because of the various types of amino acid interactions that create secondary, supersecondary, and tertiary interactions. Computational modeling can be used to simulate the biophysical and biochemical interactions that determine protein folding. Effective folding to a desired protein configuration requires a compromise between speed, stability, and specificity. If the primary sequence of amino acids emphasizes one of these characteristics, the others might suffer and the folding process may not be optimized. We provide an example of a model peptide whose primary sequence produces a highly stable supersecondary two-helix bundle structure, but at the expense of lower speed and specificity of the folding process. We show how computational simulations can be used to discover the configuration of the kinetic trap that causes the degradation in the speed and specificity of folding. We also show how amino acid sequences can be engineered by specific substitutions to optimize the folding to the desired supersecondary structure.

  14. Isolation and amino-acid sequence determination of monkey insulin and proinsulin.

    PubMed

    Naithani, V K; Steffens, G J; Tager, H S; Buse, G; Rubenstein, A H; Steiner, D F

    1984-05-01

    Insulin has been isolated and purified from rhesus monkey pancreas by means of acid-ethanol extraction, gel filtration and ion exchange chromatography. The complete amino-acid sequence of the hormone has been determined by amino-acid analysis of the oxidized A- and B-chains, by end group determination, by the identification of the C-terminal residues (AsnA21 and ThrB30) by carboxypeptidase A digestion and by Edman degradation of the S-carboxymethylated A- and B-chains. The 51-residue monkey insulin was shown to be identical to human insulin. From the known insulin and C-peptide sequence the primary sequence of monkey proinsulin has been proposed.

  15. Thin-film technology for direct visual detection of nucleic acid sequences: applications in clinical research.

    PubMed

    Jenison, Robert D; Bucala, Richard; Maul, Diana; Ward, David C

    2006-01-01

    Certain optical conditions permit the unaided eye to detect thickness changes on surfaces on the order of 20 A, which are of similar dimensions to monomolecular interactions between proteins or hybridization of complementary nucleic acid sequences. Such detection exploits specific interference of reflected white light, wherein thickness changes are perceived as surface color changes. This technology, termed thin-film detection, allows for the visualization of subattomole amounts of nucleic acid targets, even in complex clinical samples. Thin-film technology has been applied to a broad range of clinically relevant indications, including the detection of pathogenic bacterial and viral nucleic acid sequences and the discrimination of sequence variations in human genes causally related to susceptibility or severity of disease.

  16. Amino acid sequences of two trypsin inhibitors from winged bean seeds (Psophocarpus tetragonolobus (L)DC.).

    PubMed

    Yamamoto, M; Hara, S; Ikenaka, T

    1983-09-01

    The trypsin inhibitor (WTI-1) purified from winged bean seeds is a Kunitz type protease inhibitor having a molecular weight of 19,200. WTI-1 inhibits bovine trypsin stoichiometrically, but not bovine alpha-chymotrypsin. The approximate Ki value for the trypsin-inhibitor complex is 2.5 X 10(-9) M. The complete amino acid sequence of WTI-1 was determined by conventional methods. Comparison of the sequence with that of soybean trypsin inhibitor (STI) indicated that the sequence of WTI-1 had 50% homology with that of STI. WTI-1 was separated into 2 homologous inhibitors, WTI-1A and WTI-1B, by isoelectric focusing. The isoelectric points of WTI-1A and WTI-1B were 8.5 and 9.4, respectively, and their sequences were presumed from their amino acid compositions.

  17. Conservation of Shannon's redundancy for proteins. [information theory applied to amino acid sequences

    NASA Technical Reports Server (NTRS)

    Gatlin, L. L.

    1974-01-01

    Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.

  18. Protein-Sol: a web tool for predicting protein solubility from sequence.

    PubMed

    Hebditch, Max; Carballo-Amador, M Alejandro; Charonis, Spyros; Curtis, Robin; Warwicker, Jim

    2017-10-01

    Protein solubility is an important property in industrial and therapeutic applications. Prediction is a challenge, despite a growing understanding of the relevant physicochemical properties. Protein-Sol is a web server for predicting protein solubility. Using available data for Escherichia coli protein solubility in a cell-free expression system, 35 sequence-based properties are calculated. Feature weights are determined from separation of low and high solubility subsets. The model returns a predicted solubility and an indication of the features which deviate most from average values. Two other properties are profiled in windowed calculation along the sequence: fold propensity, and net segment charge. The utility of these additional features is demonstrated with the example of thioredoxin. The Protein-Sol webserver is available at http://protein-sol.manchester.ac.uk. jim.warwicker@manchester.ac.uk.

  19. Bayesian segmental models with multiple sequence alignment profiles for protein secondary structure and contact map prediction.

    PubMed

    Chu, Wei; Ghahramani, Zoubin; Podtelezhnikov, Alexei; Wild, David L

    2006-01-01

    In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in beta-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html.

  20. RNA internal standard synthesis by nucleic acid sequence-based amplification for competitive quantitative amplification reactions.

    PubMed

    Lo, Wan-Yu; Baeumner, Antje J

    2007-02-15

    Nucleic acid sequence-based amplification (NASBA) reactions have been demonstrated to successfully synthesize new sequences based on deletion and insertion reactions. Two RNA internal standards were synthesized for use in competitive amplification reactions in which quantitative analysis can be achieved by coamplifying the internal standard with the wild type sample. The sequences were created in two consecutive NASBA reactions using the E. coli clpB mRNA sequence as model analyte. The primer sequences of the wild type sequence were maintained, and a 20-nt-long segment inside the amplicon region was exchanged for a new segment of similar GC content and melting temperature. The new RNA sequence was thus amplifiable using the wild type primers and detectable via a new inserted sequence. In the first reaction, the forwarding primer and an additional 20-nt-long sequence was deleted and replaced by a new 20-nt-long sequence. In the second reaction, a forwarding primer containing as 5' overhang sequence the wild type primer sequence was used. The presence of pure internal standard was verified using electrochemiluminescence and RNA lateral-flow biosensor analysis. Additional sequence deletion in order to shorten the internal standard amplicons and thus generate higher detection signals was found not to be required. Finally, a competitive NASBA reaction between one internal standard and the wild type sequence was carried out proving its functionality. This new rapid construction method via NASBA provides advantages over the traditional techniques since it requires no traditional cloning procedures, no thermocyclers, and can be completed in less than 4 h.

  1. RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers

    PubMed Central

    BINDEWALD, ECKART; SHAPIRO, BRUCE A.

    2006-01-01

    We present a machine learning method (a hierarchical network of k-nearest neighbor classifiers) that uses an RNA sequence alignment in order to predict a consensus RNA secondary structure. The input to the network is the mutual information, the fraction of complementary nucleotides, and a novel consensus RNAfold secondary structure prediction of a pair of alignment columns and its nearest neighbors. Given this input, the network computes a prediction as to whether a particular pair of alignment columns corresponds to a base pair. By using a comprehensive test set of 49 RFAM alignments, the program KNetFold achieves an average Matthews correlation coefficient of 0.81. This is a significant improvement compared with the secondary structure prediction methods PFOLD and RNAalifold. By using the example of archaeal RNase P, we show that the program can also predict pseudoknot interactions. PMID:16495232

  2. RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers.

    PubMed

    Bindewald, Eckart; Shapiro, Bruce A

    2006-03-01

    We present a machine learning method (a hierarchical network of k-nearest neighbor classifiers) that uses an RNA sequence alignment in order to predict a consensus RNA secondary structure. The input to the network is the mutual information, the fraction of complementary nucleotides, and a novel consensus RNAfold secondary structure prediction of a pair of alignment columns and its nearest neighbors. Given this input, the network computes a prediction as to whether a particular pair of alignment columns corresponds to a base pair. By using a comprehensive test set of 49 RFAM alignments, the program KNetFold achieves an average Matthews correlation coefficient of 0.81. This is a significant improvement compared with the secondary structure prediction methods PFOLD and RNAalifold. By using the example of archaeal RNase P, we show that the program can also predict pseudoknot interactions.

  3. Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources

    PubMed Central

    Mizianty, Marcin J.; Stach, Wojciech; Chen, Ke; Kedarisetti, Kanaka Durga; Disfani, Fatemeh Miri; Kurgan, Lukasz

    2010-01-01

    Motivation: Intrinsically disordered proteins play a crucial role in numerous regulatory processes. Their abundance and ubiquity combined with a relatively low quantity of their annotations motivate research toward the development of computational models that predict disordered regions from protein sequences. Although the prediction quality of these methods continues to rise, novel and improved predictors are urgently needed. Results: We propose a novel method, named MFDp (Multilayered Fusion-based Disorder predictor), that aims to improve over the current disorder predictors. MFDp is as an ensemble of 3 Support Vector Machines specialized for the prediction of short, long and generic disordered regions. It combines three complementary disorder predictors, sequence, sequence profiles, predicted secondary structure, solvent accessibility, backbone dihedral torsion angles, residue flexibility and B-factors. Our method utilizes a custom-designed set of features that are based on raw predictions and aggregated raw values and recognizes various types of disorder. The MFDp is compared at the residue level on two datasets against eight recent disorder predictors and top-performing methods from the most recent CASP8 experiment. In spite of using training chains with ≤25% similarity to the test sequences, our method consistently and significantly outperforms the other methods based on the MCC index. The MFDp outperforms modern disorder predictors for the binary disorder assignment and provides competitive real-valued predictions. The MFDp's outputs are also shown to outperform the other methods in the identification of proteins with long disordered regions. Availability: http://biomine.ece.ualberta.ca/MFDp.html Supplementary information: Supplementary data are available at Bioinformatics online. Contact: lkurgan@ece.ualberta.ca PMID:20823312

  4. Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns

    PubMed Central

    2007-01-01

    We have converted genome-encoded protein sequences into musical notes to reveal auditory patterns without compromising musicality. We derived a reduced range of 13 base notes by pairing similar amino acids and distinguishing them using variations of three-note chords and codon distribution to dictate rhythm. The conversion will help make genomic coding sequences more approachable for the general public, young children, and vision-impaired scientists. PMID:17477882

  5. Solubility Challenges in High Concentration Monoclonal Antibody Formulations: Relationship with Amino Acid Sequence and Intermolecular Interactions.

    PubMed

    Pindrus, Mariya; Shire, Steven J; Kelley, Robert F; Demeule, Barthélemy; Wong, Rita; Xu, Yiren; Yadav, Sandeep

    2015-11-02

    The purpose of this work was to elucidate the molecular interactions leading to monoclonal antibody self-association and precipitation and utilize biophysical measurements to predict solubility behavior at high protein concentration. Two monoclonal antibodies (mAb-G and mAb-R) binding to overlapping epitopes were investigated. Precipitation of mAb-G solutions was most prominent at high ionic strength conditions and demonstrated strong dependence on ionic strength, as well as slight dependence on solution pH. At similar conditions no precipitation was observed for mAb-R solutions. Intermolecular interactions (interaction parameter, kD) related well with high concentration solubility behavior of both antibodies. Upon increasing buffer ionic strength, interactions of mAb-R tended to weaken, while those of mAb-G became more attractive. To investigate the role of amino acid sequence on precipitation behavior, mutants were designed by substituting the CDR of mAb-R into the mAb-G framework (GM-1) or deleting two hydrophobic residues in the CDR of mAb-G (GM-2). No precipitation was observed at high ionic strength for either mutant. The molecular interactions of mutants were similar in magnitude to those of mAb-R. The results suggest that presence of hydrophobic groups in the CDR of mAb-G may be responsible for compromising its solubility at high ionic strength conditions since deleting these residues mitigated the solubility issue.

  6. Rat androgen-binding protein: evidence for identical subunits and amino acid sequence homology with human sex hormone-binding globulin.

    PubMed

    Joseph, D R; Hall, S H; French, F S

    1987-01-01

    The cDNA for rat androgen-binding protein (ABP) was previously isolated from a bacteriophage lambda gt11 rat testis cDNA library and its identity was confirmed by epitope selection. Hybrid-arrested translation studies have now demonstrated the identity of the isolates. The nucleotide sequence of a near full-length cDNA encodes a 403-amino acid precursor (Mr = 44,539), which agrees in size with the cell-free translation product (Mr = 45,000) of ABP mRNA. Putative sites of N-glycosylation and signal peptide cleavage were identified. Comparison of the predicted amino acid sequence of rat ABP with the amino-terminal amino acid sequence of human sex hormone-binding globulin revealed that 17 of 25 residues are identical. On the basis of the predicted amino acid sequence the molecular weight of the primary translation product, lacking the signal peptide, was 41,183. Hybridization analyses indicated that the two subunits of ABP are coded for by a single gene and a single mRNA species. Our results suggest that ABP consists of two subunits with identical primary sequences and that differences in post-translational processing result in the production of 47,000 and 41,000 molecular weight monomers.

  7. Diagnostics based on nucleic acid sequence variant profiling: PCR, hybridization, and NGS approaches.

    PubMed

    Khodakov, Dmitriy; Wang, Chunyan; Zhang, David Yu

    2016-10-01

    Nucleic acid sequence variations have been implicated in many diseases, and reliable detection and quantitation of DNA/RNA biomarkers can inform effective therapeutic action, enabling precision medicine. Nucleic acid analysis technologies being translated into the clinic can broadly be classified into hybridization, PCR, and sequencing, as well as their combinations. Here we review the molecular mechanisms of popular commercial assays, and their progress in translation into in vitro diagnostics. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.

  8. Key amino acids in the aryl hydrocarbon receptor predict dioxin sensitivity in avian species.

    PubMed

    Head, Jessica A; Hahn, Mark E; Kennedy, Sean W

    2008-10-01

    Dioxin-like compounds are toxic to most vertebrates, but significant differences in sensitivity exist among species. A recent study suggests that the amino acid residues corresponding to Ile324 and Ser380 in the chicken aryl hydrocarbon receptor 1 (AHR1) are important determinants of differential biochemical responses to 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) in chickens and common terns. Here, we investigate whether the identity of these amino acid residues can predict embryonic sensitivity to dioxin-like compounds in a wide range of birds. AHR1 sequences were determined in species for which sensitivity data were available. Of all the species surveyed, chickens were unique in having the Ile/Ser genotype and were also the most sensitive to dioxin-like compounds. Turkeys, ring-necked pheasants, and Eastern bluebirds (intermediate Ile/Ala genotype) were less sensitive than chickens but more sensitive than American kestrels, common terns, double-crested cormorants, Japanese quail, herring gulls, or ducks (Val/ Ala genotype). Our work suggests that key amino acids in the AHR1 ligand binding domain are predictive of broad categories of dioxin sensitivity in avian species. Given the large degree of variation in species sensitivity and the paucity of species-specific toxicity data, a genetic screen based on these findings could substantially improve risk assessment for dioxin-like compounds in wild birds.

  9. Ab initio detection of fuzzy amino acid tandem repeats in protein sequences

    PubMed Central

    2012-01-01

    Background Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). Results In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. Conclusions PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available. PMID:22536906

  10. Complete amino acid sequence of the catalytic subunit of bovine cardiac muscle cyclic AMP-dependent protein kinase.

    PubMed Central

    Shoji, S; Parmelee, D C; Wade, R D; Kumar, S; Ericsson, L H; Walsh, K A; Neurath, H; Long, G L; Demaille, J G; Fischer, E H; Titani, K

    1981-01-01

    The complete amino acid sequence of the 349-residue catalytic subunit of cyclic AMP-dependent protein kinase from bovine cardiac muscle is presented. The sequence of the subunit (Mr 40,580 including phosphate groups at threonine-196 and serine-337) was derived largely by automated Edman degradation of nine fragments generated from the carboxymethylated protein by cleavage of methionyl bonds with cyanogen bromide. These fragments were aligned along the polypeptide chain by analysis of methionine-containing tryptic peptides isolated from protein radiolabeled in vitro by [14C]methyl exchange at methionyl residues. The molecule contains only two cysteinyl residues, at positions 198 and 342. It is relatively polar, containing clusters of cationic residues toward the amino terminus and anionic residues towards the carboxyl terminus. Predictions of secondary structure suggest the presence of three major domains with approximately half of the residues occurring in alpha-helices and 12% in beta-strands. PMID:6262777

  11. Multiscale approach to the predictability of earthquakes and of synthetic SOC sequences

    NASA Astrophysics Data System (ADS)

    Peresan, A.; Panza, G. F.

    2003-04-01

    The power-law scaling expressed by the Gutenberg-Richter (GR) law is the main argument in favour of the Self-Organised Criticality (SOC) of seismic phenomena. Nevertheless the limits of validity of the GR law and the phenomenology reproduced by the SOC models, as well as their consequences for earthquake predictability, still remain quite undefined. According to the Multiscale Seismicity (MS) model, the GR law describes adequately only the ensemble of earthquakes that are geometrically small with respect to the dimensions of the analysed region. The MS model and its implications for intermediate-term medium-range earthquake predictions are thus examined, considering both the seismicity observed in the Italian territory and the synthetic sequences of events generated by a SOC model. The predictability of the large events is evaluated by means of the algorithms CN and M8, based on a quantitative analysis of the seismic flow within a delimited region, which allow for the prediction of the earthquakes with magnitude greater than a fixed threshold Mo. Considering the application of CN and M8 to the Italian territory, we show that, in agreement with the MS model, these algorithms make use of the information carried by small and moderate earthquakes, following the GR law, to predict the strong earthquakes, which are infrequent and often arbitrarily considered characteristic events inside the regions delimited for prediction purposes. Similarly, the application of the algorithm CN for the prediction of the largest events in the synthetic SOC sequences, indicates that a certain predictability can be attained, when the MS model is taken into account. These results suggest that the similarity between the seismic flow and the SOC sequences goes beyond the average features of scale-invariance. In fact, while the GR law describes an average feature of seismicity, CN algorithm is checking for the deviations from such trend, which may characterise the sequence of events before the

  12. Using sequence-specific chemical and structural properties of DNA to predict transcription factor binding sites.

    PubMed

    Bauer, Amy L; Hlavacek, William S; Unkefer, Pat J; Mu, Fangping

    2010-11-18

    An important step in understanding gene regulation is to identify the DNA binding sites recognized by each transcription factor (TF). Conventional approaches to prediction of TF binding sites involve the definition of consensus sequences or position-specific weight matrices and rely on statistical analysis of DNA sequences of known binding sites. Here, we present a method called SiteSleuth in which DNA structure prediction, computational chemistry, and machine learning are applied to develop models for TF binding sites. In this approach, binary classifiers are trained to discriminate between true and false binding sites based on the sequence-specific chemical and structural features of DNA. These features are determined via molecular dynamics calculations in which we consider each base in different local neighborhoods. For each of 54 TFs in Escherichia coli, for which at least five DNA binding sites are documented in RegulonDB, the TF binding sites and portions of the non-coding genome sequence are mapped to feature vectors and used in training. According to cross-validation analysis and a comparison of computational predictions against ChIP-chip data available for the TF Fis, SiteSleuth outperforms three conventional approaches: Match, MATRIX SEARCH, and the method of Berg and von Hippel. SiteSleuth also outperforms QPMEME, a method similar to SiteSleuth in that it involves a learning algorithm. The main advantage of SiteSleuth is a lower false positive rate.

  13. Computer Aided Prediction of Biological Activity Spectra: Study of Correlation between Predicted and Observed Activities for Coumarin-4-Acetic Acids

    PubMed Central

    Basanagouda, M.; Jadhav, V. B.; Kulkarni, M. V.; Rao, R. Nagendra

    2011-01-01

    Coumarin-4-acetic acids have been synthesized from various phenols and citric acid under Pechmann cyclisation conditions. All the compounds have been evaluated for antiinflammatory and analgesic activity in acute models. Compounds have also been evaluated for their ulcerogenic potential. Using the computer program, prediction of activity spectra for substances, prediction results and their Pharma Expert software, we have found a correlation between the observed and predicted antiinflammatory activity. PMID:22131629

  14. Discriminative Prediction of A-To-I RNA Editing Events from DNA Sequence

    PubMed Central

    Sun, Jiangming; Singh, Pratibha; Bagge, Annika; Valtat, Bérengère; Vikman, Petter; Spégel, Peter; Mulder, Hindrik

    2016-01-01

    RNA editing is a post-transcriptional alteration of RNA sequences that, via insertions, deletions or base substitutions, can affect protein structure as well as RNA and protein expression. Recently, it has been suggested that RNA editing may be more frequent than previously thought. A great impediment, however, to a deeper understanding of this process is the paramount sequencing effort that needs to be undertaken to identify RNA editing events. Here, we describe an in silico approach, based on machine learning, that ameliorates this problem. Using 41 nucleotide long DNA sequences, we show that novel A-to-I RNA editing events can be predicted from known A-to-I RNA editing events intra- and interspecies. The validity of the proposed method was verified in an independent experimental dataset. Using our approach, 203 202 putative A-to-I RNA editing events were predicted in the whole human genome. Out of these, 9% were previously reported. The remaining sites require further validation, e.g., by targeted deep sequencing. In conclusion, the approach described here is a useful tool to identify potential A-to-I RNA editing events without the requirement of extensive RNA sequencing. PMID:27764195

  15. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology.

    PubMed

    Bakhtiarizadeh, Mohammad Reza; Moradi-Shahrbabak, Mohammad; Ebrahimi, Mansour; Ebrahimie, Esmaeil

    2014-09-07

    Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods. Copyright © 2014 Elsevier Ltd. All rights reserved.

  16. The amino acid sequence around the active-site cysteine and histidine residues of stem bromelain

    PubMed Central

    Husain, S. S.; Lowe, G.

    1970-01-01

    Stem bromelain that had been irreversibly inhibited with 1,3-dibromo[2-14C]-acetone was reduced with sodium borohydride and carboxymethylated with iodoacetic acid. After digestion with trypsin and α-chymotrypsin three radioactive peptides were isolated chromatographically. The amino acid sequences around the cross-linked cysteine and histidine residues were determined and showed a high degree of homology with those around the active-site cysteine and histidine residues of papain and ficin. PMID:5420046

  17. A time series based sequence prediction algorithm to detect activities of daily living in smart home.

    PubMed

    Marufuzzaman, M; Reaz, M B I; Ali, M A M; Rahman, L F

    2015-01-01

    The goal of smart homes is to create an intelligent environment adapting the inhabitants need and assisting the person who needs special care and safety in their daily life. This can be reached by collecting the ADL (activities of daily living) data and further analysis within existing computing elements. In this research, a very recent algorithm named sequence prediction via enhanced episode discovery (SPEED) is modified and in order to improve accuracy time component is included. The modified SPEED or M-SPEED is a sequence prediction algorithm, which modified the previous SPEED algorithm by using time duration of appliance's ON-OFF states to decide the next state. M-SPEED discovered periodic episodes of inhabitant behavior, trained it with learned episodes, and made decisions based on the obtained knowledge. The results showed that M-SPEED achieves 96.8% prediction accuracy, which is better than other time prediction algorithms like PUBS, ALZ with temporal rules and the previous SPEED. Since human behavior shows natural temporal patterns, duration times can be used to predict future events more accurately. This inhabitant activity prediction system will certainly improve the smart homes by ensuring safety and better care for elderly and handicapped people.

  18. PAIRpred: partner-specific prediction of interacting residues from sequence and structure.

    PubMed

    Minhas, Fayyaz ul Amir Afsar; Geiss, Brian J; Ben-Hur, Asa

    2014-07-01

    We present a novel partner-specific protein-protein interaction site prediction method called PAIRpred. Unlike most existing machine learning binding site prediction methods, PAIRpred uses information from both proteins in a protein complex to predict pairs of interacting residues from the two proteins. PAIRpred captures sequence and structure information about residue pairs through pairwise kernels that are used for training a support vector machine classifier. As a result, PAIRpred presents a more detailed model of protein binding, and offers state of the art accuracy in predicting binding sites at the protein level as well as inter-protein residue contacts at the complex level. We demonstrate PAIRpred's performance on Docking Benchmark 4.0 and recent CAPRI targets. We present a detailed performance analysis outlining the contribution of different sequence and structure features, together with a comparison to a variety of existing interface prediction techniques. We have also studied the impact of binding-associated conformational change on prediction accuracy and found PAIRpred to be more robust to such structural changes than existing schemes. As an illustration of the potential applications of PAIRpred, we provide a case study in which PAIRpred is used to analyze the nature and specificity of the interface in the interaction of human ISG15 protein with NS1 protein from influenza A virus. Python code for PAIRpred is available at http://combi.cs.colostate.edu/supplements/pairpred/. © 2013 Wiley Periodicals, Inc.

  19. Sequence-based prediction of protein protein interaction using a deep-learning algorithm.

    PubMed

    Sun, Tanlin; Zhou, Bo; Lai, Luhua; Pei, Jianfeng

    2017-05-25

    Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

  20. Amino acid sequences of two nonspecific lipid-transfer proteins from germinated castor bean.

    PubMed

    Takishima, K; Watanabe, S; Yamada, M; Suga, T; Mamiya, G

    1988-11-01

    The amino acid sequence of two nonspecific lipid-transfer proteins (nsLTP) B and C from germinated castor bean seeds have been determined. Both the proteins consist of 92 residues, as for nsLTP previously reported, and their calculated Mr values are 9847 and 9593 for nsLTP-B and nsLTP-C, respectively. The sequences of nsLTP-B and nsLTP-C, compared to the known sequence of nsLTP-A from the same source, are 68% and 35% similar, respectively. No variation was found at the positions of the cysteine residues, indicating that they might be involved in disulfide bridges.

  1. Predicting the Viscosity of Low VOC Vinyl Ester and Fatty Acid-Based Resins

    DTIC Science & Technology

    2005-12-01

    The sample was titrated with the perchloric acid / peracetic acid solution (Aldrich) until the indicator, 0.1% crystal violet in acetic acid (Aldrich...Predicting the Viscosity of Low VOC Vinyl Ester and Fatty Acid -Based Resins by John J. La Scala, Amutha Jeyarajasingam, Cherise Winston...Aberdeen Proving Ground, MD 21005-5069 ARL-TR-3681 December 2005 Predicting the Viscosity of Low VOC Vinyl Ester and Fatty Acid -Based

  2. Plasma long-chain free fatty acids predict mammalian longevity.

    PubMed

    Jové, Mariona; Naudí, Alba; Aledo, Juan Carlos; Cabré, Rosanna; Ayala, Victoria; Portero-Otin, Manuel; Barja, Gustavo; Pamplona, Reinald

    2013-11-28

    Membrane lipid composition is an important correlate of the rate of aging of animals and, therefore, the determination of their longevity. In the present work, the use of high-throughput technologies allowed us to determine the plasma lipidomic profile of 11 mammalian species ranging in maximum longevity from 3.5 to 120 years. The non-targeted approach revealed a specie-specific lipidomic profile that accurately predicts the animal longevity. The regression analysis between lipid species and longevity demonstrated that the longer the longevity of a species, the lower is its plasma long-chain free fatty acid (LC-FFA) concentrations, peroxidizability index, and lipid peroxidation-derived products content. The inverse association between longevity and LC-FFA persisted after correction for body mass and phylogenetic interdependence. These results indicate that the lipidomic signature is an optimized feature associated with animal longevity, emerging LC-FFA as a potential biomarker of longevity.

  3. A classification of glycosyl hydrolases based on amino acid sequence similarities.

    PubMed Central

    Henrissat, B

    1991-01-01

    The amino acid sequences of 301 glycosyl hydrolases and related enzymes have been compared. A total of 291 sequences corresponding to 39 EC entries could be classified into 35 families. Only ten sequences (less than 5% of the sample) could not be assigned to any family. With the sequences available for this analysis, 18 families were found to be monospecific (containing only one EC number) and 17 were found to be polyspecific (containing at least two EC numbers). Implications on the folding characteristics and mechanism of action of these enzymes and on the evolution of carbohydrate metabolism are discussed. With the steady increase in sequence and structural data, it is suggested that the enzyme classification system should perhaps be revised. PMID:1747104

  4. In silico comparative analysis of DNA and amino acid sequences for prion protein gene.

    PubMed

    Kim, Y; Lee, J; Lee, C

    2008-01-01

    Genetic variability might contribute to species specificity of prion diseases in various organisms. In this study, structures of the prion protein gene (PRNP) and its amino acids were compared among species of which sequence data were available. Comparisons of PRNP DNA sequences among 12 species including human, chimpanzee, monkey, bovine, ovine, dog, mouse, rat, wallaby, opossum, chicken and zebrafish allowed us to identify candidate regulatory regions in intron 1 and 3'-untranslated region (UTR) in addition to the coding region. Highly conserved putative binding sites for transcription factors, such as heat shock factor 2 (HSF2) and myocite enhancer factor 2 (MEF2), were discovered in the intron 1. In 3'-UTR, the functional sequence (ATTAAA) for nucleus-specific polyadenylation was found in all the analysed species. The functional sequence (TTTTTAT) for maturation-specific polyadenylation was identically observed only in ovine, and one or two nucleotide mismatches in the other species. A comparison of the amino acid sequences in 53 species revealed a large sequence identity. Especially the octapeptide repeat region was observed in all the species but frog and zebrafish. Functional changes and susceptibility to prion diseases with various isoforms of prion protein could be caused by numeric variability and conformational changes discovered in the repeat sequences.

  5. Complete amino acid sequence of the N-terminal extension of calf skin type III procollagen.

    PubMed Central

    Brandt, A; Glanville, R W; Hörlein, D; Bruckner, P; Timpl, R; Fietzek, P P; Kühn, K

    1984-01-01

    The N-terminal extension peptide of type III procollagen, isolated from foetal-calf skin, contains 130 amino acid residues. To determine its amino acid sequence, the peptide was reduced and carboxymethylated or aminoethylated and fragmented with trypsin, Staphylococcus aureus V8 proteinase and bacterial collagenase. Pyroglutamate aminopeptidase was used to deblock the N-terminal collagenase fragment to enable amino acid sequencing. The type III collagen extension peptide is homologous to that of the alpha 1 chain of type I procollagen with respect to a three-domain structure. The N-terminal 79 amino acids, which contain ten of the 12 cysteine residues, form a compact globular domain. The next 39 amino acids are in a collagenase triplet sequence (Gly- Xaa - Yaa )n with a high hydroxyproline content. Finally, another short non-collagenous domain of 12 amino acids ends at the cleavage site for procollagen aminopeptidase, which cleaves a proline-glutamine bond. In contrast with type I procollagen, the type III procollagen extension peptides contain interchain disulphide bridges located at the C-terminus of the triple-helical domain. PMID:6331392

  6. Detection of multiple, novel reverse transcriptase coding sequences in human nucleic acids: relation to primate retroviruses

    SciTech Connect

    Shih, A.; Misra, R.; Rush, M.G.

    1989-01-01

    A variety of chemically synthesized oligonucleotides designed on the basis of amino acid and/or nucleotide sequence data were used to detect a large number of novel reverse transcriptase coding sequences in human and mouse DNAs. Procedures involving Southern blotting, library screening, and the polymerase chain reaction were all used to detect such sequences; the polymerase chain reaction was the most rapid and productive approach. In the polymerase chain reaction, oligonucleotide mixtures based on consensus sequence homologies to reverse transcriptase coding sequences and unique oligonucleotides containing perfect homology to the coding sequences of human T-cell leukemia virus types I and II were both effective in amplifying reverse transcriptase-related DNA. It is shown that human DNA contains a wide spectrum of retrovirus-related reverse transcriptase coding sequences, including some that are clearly related to human T-cell leukemia virus types I and II, some that are related to the L-1 family of long interspersed nucleotide sequences, and others that are related to previously described human endogenous proviral DNAs. In addition, human T-cell leukemia virus type I-related sequences appear to be transcribed in both normal human T cells and in a cell line derived from a human teratocarcinoma.

  7. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.

    PubMed

    Smith, Colin A; Kortemme, Tanja

    2011-01-01

    Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.

  8. Prenatal Features Predictive of Robin Sequence Identified by Fetal Magnetic Resonance Imaging.

    PubMed

    Rogers-Vizena, Carolyn R; Mulliken, John B; Daniels, Kimberly M; Estroff, Judy A

    2016-06-01

    Prenatal magnetic resonance imaging is increasingly used to detect congenital anomalies. The purpose of this study was to determine whether prenatal magnetic resonance imaging accurately characterizes features predictive of postnatal Robin sequence so that possible airway compromise and feeding difficulty at birth can be anticipated. The authors retrospectively identified pregnant women who underwent fetal magnetic resonance imaging between 2002 and 2014 and were found to be carrying a fetus with micrognathia. Micrognathia was subjectively categorized as minor, moderate, or severe. Pregnancy outcome was determined as follows: intrauterine fetal demise, elective termination, early neonatal death, or viable infant. Postnatal findings of micrognathia, Robin sequence, and associated anomalies were compared to prenatal findings. Micrognathia was identified in 123 fetuses. Fifty-two pregnancies (42.3 percent) produced a viable infant. The remainder resulted in termination in the fetal period or death shortly after birth resulting from unrelated causes. For infants who lived, prenatal micrognathia was categorized as minor (55.1 percent), moderate (30.6 percent), or severe (14.3 percent). Forty-two percent of neonates with minor prenatal micrognathia had postnatal micrognathia; however, only 11.1 percent had Robin sequence. All neonates with moderate fetal micrognathia had postnatal micrognathia, and the majority had Robin sequence (86.7 percent). All newborns with severe micrognathia had Robin sequence and all prenatally diagnosed with glossoptosis had Robin sequence. Prenatal findings of moderate or severe micrognathia or glossoptosis are predictive of postnatal Robin sequence, thus expediting appropriate perinatal management of airway and feeding problems. Diagnostic, IV.

  9. The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity.

    PubMed

    Petrovski, Slavé; Gussow, Ayal B; Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H; Allen, Andrew S; Goldstein, David B

    2015-09-01

    Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene's proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene's regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen's Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, nc

  10. The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity

    PubMed Central

    Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H.; Allen, Andrew S.; Goldstein, David B.

    2015-01-01

    Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene’s proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene’s regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen’s Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance

  11. 37 CFR 1.824 - Form and format for nucleotide and/or amino acid sequence submissions in computer readable form.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... nucleotide and/or amino acid sequence submissions in computer readable form. 1.824 Section 1.824 Patents... And/or Amino Acid Sequences § 1.824 Form and format for nucleotide and/or amino acid sequence... readable form may be created by any means, such as word processors, nucleotide/amino acid sequence editors...

  12. 37 CFR 1.824 - Form and format for nucleotide and/or amino acid sequence submissions in computer readable form.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... nucleotide and/or amino acid sequence submissions in computer readable form. 1.824 Section 1.824 Patents... And/or Amino Acid Sequences § 1.824 Form and format for nucleotide and/or amino acid sequence... readable form may be created by any means, such as word processors, nucleotide/amino acid sequence editors...

  13. 37 CFR 1.824 - Form and format for nucleotide and/or amino acid sequence submissions in computer readable form.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... nucleotide and/or amino acid sequence submissions in computer readable form. 1.824 Section 1.824 Patents... And/or Amino Acid Sequences § 1.824 Form and format for nucleotide and/or amino acid sequence... readable form may be created by any means, such as word processors, nucleotide/amino acid sequence editors...

  14. 37 CFR 1.824 - Form and format for nucleotide and/or amino acid sequence submissions in computer readable form.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... nucleotide and/or amino acid sequence submissions in computer readable form. 1.824 Section 1.824 Patents... And/or Amino Acid Sequences § 1.824 Form and format for nucleotide and/or amino acid sequence... readable form may be created by any means, such as word processors, nucleotide/amino acid sequence editors...

  15. 37 CFR 1.824 - Form and format for nucleotide and/or amino acid sequence submissions in computer readable form.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... nucleotide and/or amino acid sequence submissions in computer readable form. 1.824 Section 1.824 Patents... And/or Amino Acid Sequences § 1.824 Form and format for nucleotide and/or amino acid sequence... readable form may be created by any means, such as word processors, nucleotide/amino acid sequence editors...

  16. Noise occlusion in discrete tone sequences as a tool towards auditory predictive processing?

    PubMed

    Bendixen, Alexandra; Duwe, Susann; Reiche, Martin

    2015-11-11

    The notion of predictive coding is a common feature of many theories of auditory information processing. Experimental demonstrations of predictive auditory processing often rest on omitting predictable input in order to uncover the prediction made by the brain. Findings show that auditory cortical activity elicited by the omission of a predictable tone resembles the activity elicited by the actual tone. Here we attempted to extend this approach towards using noises instead of omissions in order to capture a more prevalent case of degraded sensory input. By applying a subtraction approach to remove ERP effects of the noise itself, auditory cortical activity elicited "behind" the noise was uncovered. We hypothesized that ERPs elicited behind noise stimuli covering predictable tones should be more similar to ERPs elicited by the actual tones than when the same comparison is made for unpredictable tones. ERP results during passive listening partly confirm this hypothesis, but also point towards some methodological caveats in this particular approach towards studying neural correlates of predictive auditory processing due to contributions from predictability-unrelated factors. A follow-up active listening condition indicated that participants were not more likely to perceive the tone sequence as continuous when a predictable tone was covered with noise than when this pertained to an unpredictable tone. Overall, the noise-based paradigm in its present form was not shown to be successful in revealing predictive processing in perceptual judgments or early neural correlates of sound processing. We discuss these findings in the contexts of predictive processing and illusory auditory continuity. This article is part of a Special Issue entitled SI: Prediction and Attention.

  17. Complete amino acid sequence of branched-chain amino acid aminotransferase (transaminase B) of Salmonella typhimurium, identification of the coenzyme-binding site and sequence comparison analysis

    SciTech Connect

    Feild, M.J.

    1988-01-01

    The complete amino acid sequence of the subunit of branched-chain amino acid aminotransferase of Salmonella typhimurium was determined by automated Edman degradation of peptide fragments generated by chemical and enzymatic digestion of S-carboxymethylated and S-pyridylethylated transaminase B. Peptide fragments of transaminase B were generated by treatment of the enzyme with trypsin, Staphylococcus aureus V8 protease, endoproteinase Lys-C, and cyanogen bromide. Protocols were developed for separation of the peptide fragments by reverse-phase high performance liquid chromatography (HPLC), ion-exchange HPLC, and SDS-urea gel electrophoresis. The enzyme subunit contains 308 amino acid residues and has a molecular weight of 33,920 daltons. The coenzyme-binding site was determined by treatment of the enzyme, containing bound pyridoxal 5-phosphate, with tritiated sodium borohydride prior to trypsin digestion. Monitoring radioactivity incorporation and peptide map comparisons with an apoenzyme tryptic digest, allowed identification of the pyridoxylated-peptide which was isolated by reverse-phase HPLC and sequenced. The coenzyme-binding site is a lysyl residue at position 159. Some peptides were further characterized by fast atom bombardment mass spectrometry.

  18. Transmembrane helix prediction using amino acid property features and latent semantic analysis.

    PubMed

    Ganapathiraju, Madhavi; Balakrishnan, N; Reddy, Raj; Klein-Seetharaman, Judith

    2008-01-01

    Prediction of transmembrane (TM) helices by statistical methods suffers from lack of sufficient training data. Current best methods use hundreds or even thousands of free parameters in their models which are tuned to fit the little data available for training. Further, they are often restricted to the generally accepted topology "cytoplasmic-transmembrane-extracellular" and cannot adapt to membrane proteins that do not conform to this topology. Recent crystal structures of channel proteins have revealed novel architectures showing that the above topology may not be as universal as previously believed. Thus, there is a need for methods that can better predict TM helices even in novel topologies and families. Here, we describe a new method "TMpro" to predict TM helices with high accuracy. To avoid overfitting to existing topologies, we have collapsed cytoplasmic and extracellular labels to a single state, non-TM. TMpro is a binary classifier which predicts TM or non-TM using multiple amino acid properties (charge, polarity, aromaticity, size and electronic properties) as features. The features are extracted from sequence information by applying the framework used for latent semantic analysis of text documents and are input to neural networks that learn the distinction between TM and non-TM segments. The model uses only 25 free parameters. In benchmark analysis TMpro achieves 95% segment F-score corresponding to 50% reduction in error rate compared to the best methods not requiring an evolutionary profile of a protein to be known. Performance is also improved when applied to more recent and larger high resolution datasets PDBTM and MPtopo. TMpro predictions in membrane proteins with unusual or disputed TM structure (K+ channel, aquaporin and HIV envelope glycoprotein) are discussed. TMpro uses very few free parameters in modeling TM segments as opposed to the very large number of free parameters used in state-of-the-art membrane prediction methods, yet achieves very

  19. Prediction of atorvastatin plasmatic concentrations in healthy volunteers using integrated pharmacogenetics sequencing.

    PubMed

    Cruz-Correa, Omar Fernando; León-Cachón, Rafael Baltazar Reyes; Barrera-Saldaña, Hugo Alberto; Soberón, Xavier

    2017-01-01

    To use variants found by next-generation sequencing to predict atorvastatin plasmatic concentration profiles (AUC) in healthy volunteers. A total of 60 healthy Mexican volunteers were enrolled in this study. We used variants with a predicted functional effect across 20 genes involved in atorvastatin metabolism to construct a regression model using a support vector approach with a radial basis function kernel to predict AUC refining it afterwards in order to explain a greater extent of the variance. The final support vector regression model using 60 variants (including six novel variants) explained 94.52% of the variance in atorvastatin AUC. An integrated analysis of several genes known to intervene in the different steps of metabolism is required to predict atorvastatin's AUC.

  20. Structure-templated predictions of novel protein interactions from sequence information.

    PubMed

    Betel, Doron; Breitkreuz, Kevin E; Isserlin, Ruth; Dewar-Darch, Danielle; Tyers, Mike; Hogue, Christopher W V

    2007-09-01

    The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain-motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information.

  1. Structure-Templated Predictions of Novel Protein Interactions from Sequence Information

    PubMed Central

    Betel, Doron; Breitkreuz, Kevin E; Isserlin, Ruth; Dewar-Darch, Danielle; Tyers, Mike; Hogue, Christopher W. V

    2007-01-01

    The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain–motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information. PMID:17892321

  2. The Motif Tool Assessment Platform (MTAP) for sequence-based transcription factor binding site prediction tools.

    PubMed

    Quest, Daniel; Ali, Hesham

    2010-01-01

    Predicting transcription factor binding sites (TFBS) from sequence is one of the most challenging problems in computational biology. The development of (semi-)automated computer-assisted prediction methods is needed to find TFBS over an entire genome, which is a first step in reconstructing mechanisms that control gene activity. Bioinformatics journals continue to publish diverse methods for predicting TFBS on a monthly basis. To help practitioners in deciding which method to use to predict for a particular TFBS, we provide a platform to assess the quality and applicability of the available methods. Assessment tools allow researchers to determine how methods can be expected to perform on specific organisms or on specific transcription factor families. This chapter introduces the TFBS detection problem and reviews current strategies for evaluating algorithm effectiveness. In this chapter, a novel and robust assessment tool, the Motif Tool Assessment Platform (MTAP), is introduced and discussed.

  3. Identification of individuals by trait prediction using whole-genome sequencing data.

    PubMed

    Lippert, Christoph; Sabatini, Riccardo; Maher, M Cyrus; Kang, Eun Yong; Lee, Seunghak; Arikan, Okan; Harley, Alena; Bernal, Axel; Garst, Peter; Lavrenko, Victor; Yocum, Ken; Wong, Theodore; Zhu, Mingfu; Yang, Wen-Yun; Chang, Chris; Lu, Tim; Lee, Charlie W H; Hicks, Barry; Ramakrishnan, Smriti; Tang, Haibao; Xie, Chao; Piper, Jason; Brewerton, Suzanne; Turpaz, Yaron; Telenti, Amalio; Roby, Rhonda K; Och, Franz J; Venter, J Craig

    2017-09-05

    Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.

  4. The primary structure of E. coli RNA polymerase, Nucleotide sequence of the rpoC gene and amino acid sequence of the beta'-subunit.

    PubMed

    Ovchinnikov YuA; Monastyrskaya, G S; Gubanov, V V; Guryev, S O; Salomatina, I S; Shuvaeva, T M; Lipkin, V M; Sverdlov, E D

    1982-07-10

    The primary structure of the E. coli rpoC gene (5321 base pairs) coding the beta'-subunit of RNA polymerase as well as its adjacent segment have been determined. The structure analysis of the peptides obtained by cleavage of the protein with cyanogen bromide and trypsin has confirmed the amino acid sequence of the beta'-subunit deduced from the nucleotide sequence analysis. The beta'-subunit of E. coli RNA polymerase contains 1407 amino acid residues. Its translation is initiated by codon GUG and terminated by codon TAA. It has been detected that the sequence following the terminating codon is strikingly homologous to known sequences of rho-independent terminators.

  5. Sequence variation divides Equine rhinitis B virus into three distinct phylogenetic groups that correlate with serotype and acid stability.

    PubMed

    Black, Wesley D; Hartley, Carol A; Ficorilli, Nino P; Studdert, Michael J

    2005-08-01

    Equine rhinitis B virus (ERBV), genus Erbovirus, family Picornaviridae, occurs as two serotypes, ERBV1 and ERBV2, and the few isolates previously tested were acid labile. Of 24 ERBV1 isolates tested in the studies reported here, 19 were acid labile and five were acid stable. The two available ERBV2 isolates, as expected, were acid labile. Nucleotide sequences of the P1 region encoding the capsid proteins VP1, VP2, VP3 and VP4 were determined for five acid-labile and three acid-stable ERBV1 isolates and one acid-labile ERBV2 isolate. The sequences were aligned with the published sequences of the prototype acid-labile ERBV1.1436/71 and the prototype ERBV2.313/75. The three acid-stable ERBV1 were closely related in a phylogenetic group that was distinct from the group of six acid-labile ERBV1, which were also closely related to each other. The two acid-labile ERBV2 formed a third distinct group. One acid-labile ERBV1 had a chimeric acid-labile/acid-stable ERBV1 P1 sequence, presumably because of a recombination event within VP2 and this was supported by SimPlot analysis. ERBV1 rabbit antiserum neutralized acid-stable and acid-labile ERBV1 isolates similarly. Accordingly, three distinct phylogenetic groups of erboviruses exist that are consistent with serotype and acid stability phenotypes.

  6. Functional Divergence in the Genus Oenococcus as Predicted by Genome Sequencing of the Newly-Described Species, Oenococcus kitaharae

    PubMed Central

    Borneman, Anthony R.; McCarthy, Jane M.; Chambers, Paul J.; Bartowsky, Eveline J.

    2012-01-01

    Oenococcus kitaharae is only the second member of the genus Oenococcus to be identified and is the closest relative of the industrially important wine bacterium Oenococcus oeni. To provide insight into this new species, the genome of the type strain of O. kitaharae, DSM 17330, was sequenced. Comparison of the sequenced genomes of both species show that the genome of O. kitaharae DSM 17330 contains many genes with predicted functions in cellular defence (bacteriocins, antimicrobials, restriction-modification systems and a CRISPR locus) which are lacking in O. oeni. The two genomes also appear to differentially encode several metabolic pathways associated with amino acid biosynthesis and carbohydrate utilization and which have direct phenotypic consequences. This would indicate that the two species have evolved different survival techniques to suit their particular environmental niches. O. oeni has adapted to survive in the harsh, but predictable, environment of wine that provides very few competitive species. However O. kitaharae appears to have adapted to a growth environment in which biological competition provides a significant selective pressure by accumulating biological defence molecules, such as bacteriocins and restriction-modification systems, throughout its genome. PMID:22235313

  7. Analyses of mitochondrial amino acid sequence datasets support the proposal that specimens of Hypodontus macropi from three species of macropodid hosts represent distinct species

    PubMed Central

    2013-01-01

    Background Hypodontus macropi is a common intestinal nematode of a range of kangaroos and wallabies (macropodid marsupials). Based on previous multilocus enzyme electrophoresis (MEE) and nuclear ribosomal DNA sequence data sets, H. macropi has been proposed to be complex of species. To test this proposal using independent molecular data, we sequenced the whole mitochondrial (mt) genomes of individuals of H. macropi from three different species of hosts (Macropus robustus robustus, Thylogale billardierii and Macropus [Wallabia] bicolor) as well as that of Macropicola ocydromi (a related nematode), and undertook a comparative analysis of the amino acid sequence datasets derived from these genomes. Results The mt genomes sequenced by next-generation (454) technology from H. macropi from the three host species varied from 13,634 bp to 13,699 bp in size. Pairwise comparisons of the amino acid sequences predicted from these three mt genomes revealed differences of 5.8% to 18%. Phylogenetic analysis of the amino acid sequence data sets using Bayesian Inference (BI) showed that H. macropi from the three different host species formed distinct, well-supported clades. In addition, sliding window analysis of the mt genomes defined variable regions for future population genetic studies of H. macropi in different macropodid hosts and geographical regions around Australia. Conclusions The present analyses of inferred mt protein sequence datasets clearly supported the hypothesis that H. macropi from M. robustus robustus, M. bicolor and T. billardierii represent distinct species. PMID:24261823

  8. The amino acid sequence of cytochromes c-551 from three species of Pseudomonas

    PubMed Central

    Ambler, R. P.; Wynn, Margaret

    1973-01-01

    The amino acid sequences of the cytochromes c-551 from three species of Pseudomonas have been determined. Each resembles the protein from Pseudomonas strain P6009 (now known to be Pseudomonas aeruginosa, not Pseudomonas fluorescens) in containing 82 amino acids in a single peptide chain, with a haem group covalently attached to cysteine residues 12 and 15. In all four sequences 43 residues are identical. Although by bacteriological criteria the organisms are closely related, the differences between pairs of sequences range from 22% to 39%. These values should be compared with the differences in the sequence of mitochondrial cytochrome c between mammals and amphibians (about 18%) or between mammals and insects (about 33%). Detailed evidence for the amino acid sequences of the proteins has been deposited as Supplementary Publication SUP 50015 at the National Lending Library for Science and Technology, Boston Spa, Yorks. LS23 7BQ, U.K., from whom copies can be obtained on the terms indicated in Biochem. J. (1973), 131, 5. PMID:4352718

  9. Draft Genome Sequence of Sorghum Grain Mold Fungus Epicoccum sorghinum, a Producer of Tenuazonic Acid

    PubMed Central

    Oliveira, Rodrigo C.; Davenport, Karen W.; Hovde, Blake; Silva, Danielle; Chain, Patrick S. G.; Correa, Benedito

    2017-01-01

    ABSTRACT The facultative plant pathogen Epicoccum sorghinum is associated with grain mold of sorghum and produces the mycotoxin tenuazonic acid. This fungus can have serious economic impact on sorghum production. Here, we report the draft genome sequence of E. sorghinum (USPMTOX48). PMID:28126937

  10. Snake venom. The amino acid sequence of protein A from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J; Strydom, D J

    1980-12-01

    Protein A from Dendroaspis polylepis polylepis venom comprises 81 amino acids, including ten half-cystine residues. The complete primary structures of protein A and its variant A' were elucidated. The sequences of proteins A and A', which differ in a single position, show no homology with various neurotoxins and non-neurotoxic proteins and represent a new type of elapid venom protein.

  11. Nucleotide sequence and spatial expression pattern of a drought- and abscisic Acid-induced gene of tomato.

    PubMed

    Plant, A L; Cohen, A; Moses, M S; Bray, E A

    1991-11-01

    The nucleotide sequence of le16, a tomato (Lycopersicon esculentum Mill.) gene induced by drought stress and regulated by abscisic acid specifically in aerial vegetative tissue, is presented. The single open reading frame contained within the gene has the capacity to encode a polypeptide of 12.7 kilodaltons and is interrupted by a small intron. The predicted polypeptide is rich in leucine, glycine, and alanine and has an isoelectric point of 8.7. The amino terminus is hydrophobic and characteristic of signal sequences that target polypeptides for export from the cytoplasm. There is homology (47.2% identity) between the amino terminus of the LE 16 polypeptide and the corresponding amino terminal domain of the maize phospholipid transfer protein. le16 was expressed in drought-stressed leaf, petiole, and stem tissue and to a much lower extent in the pericarp of mature green tomato fruit and developing seeds. No expression was detected in the pericarp of red fruit or in drought-stressed roots. Expression of le16 was also induced in leaf tissue by a variety of other abiotic stresses including polyethylene glycol-mediated water deficit, salinity, cold stress, and heat stress. None of these stresses or direct applications of abscisic acid induced the expression of le16 in the roots of the same plants. The unique expression characteristics of this gene indicates that novel regulatory mechanisms, in addition to endogenous abscisic acid, are involved in controlling gene expression.

  12. [Prediction of the GVHD after allo-HSCT by sequence similar matching method].

    PubMed

    Zhao, Dan-Dan; Liu, Zhou-Yang; Cao, Yong-Bin; Jiang, Shuang; DA, Wan-Ming; Wu, Xiao-Xiong

    2010-06-01

    This study was aimed to investigate the role of sequence similar matching (SSM) method in prediction of GVHD after HLA unmatched allogeneic hematopoietic stem cell transplantation (allo-HSCT). The data from 23 patients undergoing HLA unmatched allo-HSCT were analyzed and calculated by SSM method. The results showed that the incidence of acute and severe GVHD were significantly less in the allo-HSCT cases with total SSM value less than 55. In conclusion, the SSM method can be used to predict GVHD in the HLA-unmatched allogeneic hematopoietic stem cell transplantation.

  13. Cloning, sequencing, and heterologous expression of an Erwinia cypripedii 314B lactonase specific for L-alpha-hydroxyglutaric acid gamma-lactone.

    PubMed

    Mochizuki, Kazuya

    2006-08-01

    The gene for a lactonase that stereospecifically hydrolyzes (S)-5-oxo-2-tetrahydrofurancarboxylic acid to L-alpha-hydroxyglutaric acid was isolated from Erwinia cypripedii 314B. Determination of the nucleotide sequence showed that the gene consists of a single open reading frame of 1,152 bp that encodes a 383-amino-acid protein. Comparison of the sequence of the predicted protein to that of the enzyme purified from E. cypripedii 314B revealed an N-terminal signal sequence of 19 amino acids. The gene for the mature enzyme was inserted into a pET vector and overexpressed in Escherichia coli. Active recombinant enzyme accumulated in the cells to approximately 30% of the total protein, and the enzyme was purified to homogeneity. The physical and catalytic properties of the recombinant enzyme were indistinguishable from those of the protein purified from E. cypripedii 314B. The deduced amino acid sequence displayed approximately 35% similarity with a putative 3-carboxymuconate cyclase, but exhibited no such activity. The enzyme also showed approximately 35% similarity with 6-phosphogluconolactonase. However, the activity of the enzyme toward 6-phosphogluconolactone was less than 2% of that toward (S)-5-oxo-2-tetrahydrofurancarboxylic acid, demonstrating a novel specificity for this lactonase.

  14. Automated methods of predicting the function of biological sequences using GO and BLAST

    PubMed Central

    Jones, Craig E; Baumann, Ute; Brown, Alfred L

    2005-01-01

    Background With the exponential increase in genomic sequence data there is a need to develop automated approaches to deducing the biological functions of novel sequences with high accuracy. Our aim is to demonstrate how accuracy benchmarking can be used in a decision-making process evaluating competing designs of biological function predictors. We utilise the Gene Ontology, GO, a directed acyclic graph of functional terms, to annotate sequences with functional information describing their biological context. Initially we examine the effect on accuracy scores of increasing the allowed distance between predicted and a test set of curator assigned terms. Next we evaluate several annotator methods using accuracy benchmarking. Given an unannotated sequence we use the Basic Local Alignment Search Tool, BLAST, to find similar sequences that have already been assigned GO terms by curators. A number of methods were developed that utilise terms associated with the best five matching sequences. These methods were compared against a benchmark method of simply using terms associated with the best BLAST-matched sequence (best BLAST approach). Results The precision and recall of estimates increases rapidly as the amount of distance permitted between a predicted term and a correct term assignment increases. Accuracy benchmarking allows a comparison of annotation methods. A covering graph approach performs poorly, except where the term assignment rate is high. A term distance concordance approach has a similar accuracy to the best BLAST approach, demonstrating lower precision but higher recall. However, a discriminant function method has higher precision and recall than the best BLAST approach and other methods shown here. Conclusion Allowing term predictions to be counted correct if closely related to a correct term decreases the reliability of the accuracy score. As such we recommend using accuracy measures that require exact matching of predicted terms with curator assigned

  15. Automated methods of predicting the function of biological sequences using GO and BLAST.

    PubMed

    Jones, Craig E; Baumann, Ute; Brown, Alfred L

    2005-11-15

    With the exponential increase in genomic sequence data there is a need to develop automated approaches to deducing the biological functions of novel sequences with high accuracy. Our aim is to demonstrate how accuracy benchmarking can be used in a decision-making process evaluating competing designs of biological function predictors. We utilise the Gene Ontology, GO, a directed acyclic graph of functional terms, to annotate sequences with functional information describing their biological context. Initially we examine the effect on accuracy scores of increasing the allowed distance between predicted and a test set of curator assigned terms. Next we evaluate several annotator methods using accuracy benchmarking. Given an unannotated sequence we use the Basic Local Alignment Search Tool, BLAST, to find similar sequences that have already been assigned GO terms by curators. A number of methods were developed that utilise terms associated with the best five matching sequences. These methods were compared against a benchmark method of simply using terms associated with the best BLAST-matched sequence (best BLAST approach). The precision and recall of estimates increases rapidly as the amount of distance permitted between a predicted term and a correct term assignment increases. Accuracy benchmarking allows a comparison of annotation methods. A covering graph approach performs poorly, except where the term assignment rate is high. A term distance concordance approach has a similar accuracy to the best BLAST approach, demonstrating lower precision but higher recall. However, a discriminant function method has higher precision and recall than the best BLAST approach and other methods shown here. Allowing term predictions to be counted correct if closely related to a correct term decreases the reliability of the accuracy score. As such we recommend using accuracy measures that require exact matching of predicted terms with curator assigned terms. Furthermore, we conclude

  16. Support vector machines for prediction of protein signal sequences and their cleavage sites.

    PubMed

    Cai, Yu-Dong; Lin, Shuo-liang; Chou, Kuo-Chen

    2003-01-01

    Given a nascent protein sequence, how can one predict its signal peptide or "Zipcode" sequence? This is an important problem for scientists to use signal peptides as a vehicle to find new drugs or to reprogram cells for gene therapy (see, e.g. K.C. Chou, Current Protein and Peptide Science 2002;3:615-22). In this paper, support vector machines (SVMs), a new machine learning method, is applied to approach this problem. The overall rate of correct prediction for 1939 secretary proteins and 1440 nonsecretary proteins was over 91%. It has not escaped our attention that the new method may also serve as a useful tool for further investigating many unclear details regarding the molecular mechanism of the ZIP code protein-sorting system in cells. Copyright 2002 Elsevier Science Inc.

  17. Linguistic and spatial skills predict early arithmetic development via counting sequence knowledge.

    PubMed

    Zhang, Xiao; Koponen, Tuire; Räsänen, Pekka; Aunola, Kaisa; Lerkkanen, Marja-Kristiina; Nurmi, Jari-Erik

    2014-01-01

    Utilizing a longitudinal sample of Finnish children (ages 6-10), two studies examined how early linguistic (spoken vs. written) and spatial skills predict later development of arithmetic, and whether counting sequence knowledge mediates these associations. In Study 1 (N = 1,880), letter knowledge and spatial visualization, measured in kindergarten, predicted the level of arithmetic in first grade, and later growth through third grade. Study 2 (n = 378) further showed that these associations were mediated by counting sequence knowledge measured in first grade. These studies add to the literature by demonstrating the importance of written language for arithmetic development. The findings are consistent with the hypothesis that linguistic and spatial skills can improve arithmetic development by enhancing children's number-related knowledge.

  18. Fast and Accurate Accessible Surface Area Prediction Without a Sequence Profile.

    PubMed

    Faraggi, Eshel; Kouza, Maksim; Zhou, Yaoqi; Kloczkowski, Andrzej

    2017-01-01

    A fast accessible surface area (ASA) predictor is presented. In this new approach no residue mutation profiles generated by multiple sequence alignments are used as inputs. Instead, we use only single sequence information and global features such as single-residue and two-residue compositions of the chain. The resulting predictor is both highly more efficient than sequence alignment based predictors and of comparable accuracy to them. Introduction of the global inputs significantly helps achieve this comparable accuracy. The predictor, termed ASAquick, is found to perform similarly well for so-called easy and hard cases indicating generalizability and possible usability for de-novo protein structure prediction. The source code and a Linux executables for ASAquick are available from Research and Information Systems at http://mamiris.com and from the Battelle Center for Mathematical Medicine at http://mathmed.org .

  19. Structured prediction models for RNN based sequence labeling in clinical text

    PubMed Central

    Jagannatha, Abhyuday N; Yu, Hong

    2016-01-01

    Sequence labeling is a widely used method for named entity recognition and information extraction from unstructured natural language data. In clinical domain one major application of sequence labeling involves extraction of medical entities such as medication, indication, and side-effects from Electronic Health Record narratives. Sequence labeling in this domain, presents its own set of challenges and objectives. In this work we experimented with various CRF based structured learning models with Recurrent Neural Networks. We extend the previously studied LSTM-CRF models with explicit modeling of pairwise potentials. We also propose an approximate version of skip-chain CRF inference with RNN potentials. We use these methodologies1 for structured prediction in order to improve the exact phrase detection of various medical entities. PMID:28004040

  20. CATH: an expanded resource to predict protein function through structure and sequence

    PubMed Central

    Dawson, Natalie L.; Lewis, Tony E.; Das, Sayoni; Lees, Jonathan G.; Lee, David; Ashford, Paul; Orengo, Christine A.; Sillitoe, Ian

    2017-01-01

    The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.1, http://www.cathdb.info). The resource comprises over 300 000 domain structures and over 53 million protein domains classified into 2737 homologous superfamilies, doubling the number of predicted protein domains in the previous version. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 100 000 additional protein domains. This article describes developments to the CATH-Gene3D resource over the last two years since the publication in 2015, including: significant increases to our structural and sequence coverage; expansion of the functional families in CATH; building a support vector machine (SVM) to automatically assign domains to superfamilies; improved search facilities to return alignments of query sequences against multiple sequence alignments; the redesign of the web pages and download site. PMID:27899584

  1. Resetting the Site: Redirecting Integration of an Insertion Sequence in a Predictable Way

    PubMed Central

    Guynet, Catherine; Achard, Adeline; Hoang, Bao Ton; Barabas, Orsolya; Hickman, Alison Burgess; Dyda, Frederick; Chandler, Michael

    2013-01-01

    Target site choice is a complex and poorly understood aspect of DNA transposition despite its importance in rational transposon-mediated gene delivery. Thoughmost transposons choose target sites essentially randomly or with some slight sequence or structural preferences, insertion sequence IS608 from Helicobacter pylori, which transposes using single-stranded DNA, always inserts just 3′ of a TTAC tetranucleotide. Our results from studies on the IS608 transposition mechanism demonstrated that the transposase recognizes its target site by co-opting an internal segment of transposon DNA and utilizes it for specific recognition of the target sites through base-pairing. This suggested a way to redirect IS608 transposition to novel target sites. As we demonstrate here, we can now direct insertions in a predictable way into a variety of different chosen target sequences, both in vitro and in vivo. PMID:19524540

  2. Amino acid sequence of myoglobin from white-tailed deer (Odocoileus virginianus).

    PubMed

    Joseph, Poulson; Suman, Surendranath P; Li, Shuting; Fontaine, Michele; Steinke, Laurey

    2012-10-01

    Our objective was to determine the primary structure of white-tailed deer myoglobin (Mb). White-tailed deer Mb was isolated from cardiac muscles employing ammonium sulfate precipitation and gel-filtration chromatography. The amino acid sequence was determined by Edman degradation. Sequence analyses of intact Mb as well as tryptic- and cyanogen bromide-peptides yielded the complete primary structure of white-tailed deer Mb, which shared 100% similarity with red deer Mb. White-tailed deer Mb consists of 153 amino acid residues and shares more than 96% sequence similarity with myoglobins from meat-producing ruminants, such as cattle, buffalo, sheep, and goat. Similar to sheep and goat myoglobins, white-tailed deer Mb contains 12 histidine residues. Proximal (position 93) and distal (position 64) histidine residues responsible for maintaining the stability of heme are conserved in white-tailed deer Mb.

  3. Amino acid sequences of heterotrophic and photosynthetic ferredoxins from the tomato plant (Lycopersicon esculentum Mill.).

    PubMed

    Kamide, K; Sakai, H; Aoki, K; Sanada, Y; Wada, K; Green, L S; Yee, B C; Buchanan, B B

    1995-11-01

    Several forms (isoproteins) of ferredoxin in roots, leaves, and green and red pericarps in tomato plants (Lycopersicon esculentum Mill.) were earlier identified on the basis of N-terminal amino acid sequence and chromatographic behavior (Green et al. 1991). In the present study, a large scale preparation made possible determination of the full length amino acid sequence of the two ferredoxins from leaves. The ferredoxins characteristic of fruit and root were sequenced from the amino terminus to the 30th residue or beyond. The leaf ferredoxins were confirmed to be expressed in pericarp of both green and red fruit. The ferredoxins characteristic of fruit and root appeared to be restricted to those tissue. The results extend earlier findings in demonstrating that ferredoxin occurs in the major organs of the tomato plant where it appears to function irrespective of photosynthetic competence.

  4. Complete complementary DNA-derived amino acid sequence of canine cardiac phospholamban.

    PubMed Central

    Fujii, J; Ueno, A; Kitano, K; Tanaka, S; Kadoma, M; Tada, M

    1987-01-01

    Complementary DNA (cDNA) clones specific for phospholamban of sarcoplasmic reticulum membranes have been isolated from a canine cardiac cDNA library. The amino acid sequence deduced from the cDNA sequence indicates that phospholamban consists of 52 amino acid residues and lacks an amino-terminal signal sequence. The protein has an inferred mol wt 6,080 that is in agreement with its apparent monomeric mol wt 6,000, estimated previously by sodium dodecyl sulfate-polyacrylamide gel electrophoresis. Phospholamban contains two distinct domains, a hydrophilic region at the amino terminus (domain I) and a hydrophobic region at the carboxy terminus (domain II). We propose that domain I is localized at the cytoplasmic surface and offers phosphorylatable sites whereas domain II is anchored into the sarcoplasmic reticulum membrane. PMID:3793929

  5. Nucleotide sequence and the encoded amino acids of human apolipoprotein A-I mRNA.

    PubMed Central

    Law, S W; Brewer, H B

    1984-01-01

    The cDNA clones encoding the precursor form of human liver apolipoprotein A-I (apoA-I), preproapoA-I, have been isolated from a cDNA library. A 17-base synthetic oligonucleotide based on residues 108-113 of apoA-I and a 26-base primer-extended, dideoxynucleotide-terminated cDNA were used as hybridization probes to select for recombinant plasmids bearing the apoA-I sequence. The complete nucleic acid sequence of human liver preproapoA-I has been determined by analysis of the cloned cDNA. The sequence is composed of 801 nucleotides encoding 267 amino acid residues. PreproapoA-I contains an 18-amino-acid prepeptide and a 6-amino-acid propeptide connected to the amino terminus of the 243-amino acid mature apoA-I. Southern blotting analysis of chromosomal DNA obtained from peripheral blood indicated the apoA-I gene is contained in a 2.1-kilobase-pair Pst I fragment and there is no gross difference in structural organization between the normal apoA-I gene and the Tangier disease apoA-I gene. Images PMID:6198645

  6. Predicting the DNA sequence dependence of nanopore ion current using atomic-resolution Brownian dynamics

    PubMed Central

    Comer, Jeffrey; Aksimentiev, Aleksei

    2012-01-01

    It has become possible to distinguish DNA molecules of different nucleotide sequences by measuring ion current passing through a narrow pore containing DNA. To assist experimentalists in interpreting the results of such measurements and to improve the DNA sequence detection method, we have developed a computational approach that has both the atomic-scale accuracy and the computational efficiency required to predict DNA sequence-specific differences in the nanopore ion current. In our Brownian dynamics method, the interaction between the ions and DNA is described by three-dimensional potential of mean force maps determined to a 0.03 nm resolution from all-atom molecular dynamics simulations. While this atomic-resolution Brownian dynamics method produces results with orders of magnitude less computational effort than all-atom molecular dynamics requires, we show here that the ion distributions and ion currents predicted by the two methods agree. Finally, using our Brownian dynamics method, we find that a small change in the sequence of DNA within a pore can cause a large change in the ion current, and validate this result with all-atom molecular dynamics. PMID:22606364

  7. Mathematical Characterization of Protein Sequences Using Patterns as Chemical Group Combinations of Amino Acids.

    PubMed

    Das, Jayanta Kumar; Das, Provas; Ray, Korak Kumar; Choudhury, Pabitra Pal; Jana, Siddhartha Sankar

    2016-01-01

    Comparison of amino acid sequence similarity is the fundamental concept behind the protein phylogenetic tree formation. By virtue of this method, we can explain the evolutionary relationships, but further explanations are not possible unless sequences are studied through the chemical nature of individual amino acids. Here we develop a new methodology to characterize the protein sequences on the basis of the chemical nature of the amino acids. We design various algorithms for studying the variation of chemical group transitions and various chemical group combinations as patterns in the protein sequences. The amino acid sequence of conventional myosin II head domain of 14 family members are taken to illustrate this new approach. We find two blocks of maximum length 6 aa as 'FPKATD' and 'Y/FTNEKL' without repeating the same chemical nature and one block of maximum length 20 aa with the repetition of chemical nature which are common among all 14 members. We also check commonality with another motor protein sub-family kinesin, KIF1A. Based on our analysis we find a common block of length 8 aa both in myosin II and KIF1A. This motif is located in the neck linker region which could be responsible for the generation of mechanical force, enabling us to find the unique blocks which remain chemically conserved across the family. We also validate our methodology with different protein families such as MYOI, Myosin light chain kinase (MLCK) and Rho-associated protein kinase (ROCK), Na+/K+-ATPase and Ca2+-ATPase. Altogether, our studies provide a new methodology for investigating the conserved amino acids' pattern in different proteins.

  8. dnaMATE: a consensus melting temperature prediction server for short DNA sequences.

    PubMed

    Panjkovich, Alejandro; Norambuena, Tomás; Melo, Francisco

    2005-07-01

    An accurate and robust large-scale melting temperature prediction server for short DNA sequences is dispatched. The server calculates a consensus melting temperature value using the nearest-neighbor model based on three independent thermodynamic data tables. The consensus method gives an accurate prediction of melting temperature, as it has been recently demonstrated in a benchmark performed using all available experimental data for DNA sequences within the length range of 16-30 nt. This constitutes the first web server that has been implemented to perform a large-scale calculation of melting temperatures in real time (up to 5000 DNA sequences can be submitted in a single run). The expected accuracy of calculations carried out by this server in the range of 50-600 mM monovalent salt concentration is that 89% of the melting temperature predictions will have an error or deviation of <5 degrees C from experimental data. The server can be freely accessed at http://dna.bio.puc.cl/tm.html. The standalone executable versions of this software for LINUX, Macintosh and Windows platforms are also freely available at the same web site. Detailed further information supporting this server is available at the same web site referenced above.

  9. dnaMATE: a consensus melting temperature prediction server for short DNA sequences

    PubMed Central

    Panjkovich, Alejandro; Norambuena, Tomás; Melo, Francisco

    2005-01-01

    An accurate and robust large-scale melting temperature prediction server for short DNA sequences is dispatched. The server calculates a consensus melting temperature value using the nearest-neighbor model based on three independent thermodynamic data tables. The consensus method gives an accurate prediction of melting temperature, as it has been recently demonstrated in a benchmark performed using all available experimental data for DNA sequences within the length range of 16–30 nt. This constitutes the first web server that has been implemented to perform a large-scale calculation of melting temperatures in real time (up to 5000 DNA sequences can be submitted in a single run). The expected accuracy of calculations carried out by this server in the range of 50–600 mM monovalent salt concentration is that 89% of the melting temperature predictions will have an error or deviation of <5°C from experimental data. The server can be freely accessed at . The standalone executable versions of this software for LINUX, Macintosh and Windows platforms are also freely available at the same web site. Detailed further information supporting this server is available at the same web site referenced above. PMID:15980538

  10. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure.

    PubMed

    Capra, John A; Laskowski, Roman A; Thornton, Janet M; Singh, Mona; Funkhouser, Thomas A

    2009-12-01

    Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.princeton.edu/concavity/).

  11. Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human

    PubMed Central

    Wu, Chengchao; Yao, Shixin; Li, Xinghao; Chen, Chujia; Hu, Xuehai

    2017-01-01

    DNA methylation plays a significant role in transcriptional regulation by repressing activity. Change of the DNA methylation level is an important factor affecting the expression of target genes and downstream phenotypes. Because current experimental technologies can only assay a small proportion of CpG sites in the human genome, it is urgent to develop reliable computational models for predicting genome-wide DNA methylation. Here, we proposed a novel algorithm that accurately extracted sequence complexity features (seven features) and developed a support-vector-machine-based prediction model with integration of the reported DNA composition features (trinucleotide frequency and GC content, 65 features) by utilizing the methylation profiles of embryonic stem cells in human. The prediction results from 22 human chromosomes with size-varied windows showed that the 600-bp window achieved the best average accuracy of 94.7%. Moreover, comparisons with two existing methods further showed the superiority of our model, and cross-species predictions on mouse data also demonstrated that our model has certain generalization ability. Finally, a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. Accordingly, we believe that our novel model will be useful and reliable in predicting DNA methylation. PMID:28212312

  12. Blind prediction of deleterious amino acid variations with SNPs&GO.

    PubMed

    Capriotti, Emidio; Martelli, Pier Luigi; Fariselli, Piero; Casadio, Rita

    2017-09-01

    SNPs&GO is a machine learning method for predicting the association of single amino acid variations (SAVs) to disease, considering protein functional annotation. The method is a binary classifier that implements a support vector machine algorithm to discriminate between disease-related and neutral SAVs. SNPs&GO combines information from protein sequence with functional annotation encoded by gene ontology (GO) terms. Tested in sequence mode on more than 38,000 SAVs from the SwissVar dataset, our method reached 81% overall accuracy and an area under the receiving operating characteristic curve of 0.88 with low false-positive rate. In almost all the editions of the Critical Assessment of Genome Interpretation (CAGI) experiments, SNPs&GO ranked among the most accurate algorithms for predicting the effect of SAVs. In this paper, we summarize the best results obtained by SNPs&GO on disease-related variations of four CAGI challenges relative to the following genes: CHEK2 (CAGI 2010), RAD50 (CAGI 2011), p16-INK (CAGI 2013), and NAGLU (CAGI 2016). Result evaluation provides insights about the accuracy of our algorithm and the relevance of GO terms in annotating the effect of the variants. It also helps to define good practices for the detection of deleterious SAVs. © 2017 Wiley Periodicals, Inc.

  13. Software scripts for quality checking of high-throughput nucleic acid sequencers.

    PubMed

    Lazo, G R; Tong, J; Miller, R; Hsia, C; Rausch, C; Kang, Y; Anderson, O D

    2001-06-01

    We have developed a graphical interface to allow the researcher to view and assess the quality of sequencing results using a series of program scripts developed to process data generated by automated sequencers. The scripts are written in Perl programming language and are executable under the cgibin directory of a Web server environment. The scripts direct nucleic acid sequencing trace file data output from automated sequencers to be analyzed by the phred molecular biology program and are displayed as graphical hypertext mark-up language (HTML) pages. The scripts are mainly designed to handle 96-well microtiter dish samples, but the scripts are also able to read data from 384-well microtiter dishes 96 samples at a time. The scripts may be customized for different laboratory environments and computer configurations. Web links to the sources and discussion page are provided.

  14. Further characterization and amino acid sequence of m-type thioredoxins from spinach chloroplasts.

    PubMed

    Maeda, K; Tsugita, A; Dalzoppo, D; Vilbois, F; Schürmann, P

    1986-01-02

    The complete primary structure of m-type thioredoxin from spinach chloroplasts has been sequenced by conventional sequencing including fragmentation, Edman degradation and carboxypeptidase digestion. As already reported [Tsugita, A., Maeda, K. & Schürmann, P. (1983) Biochem. Biophys. Res. Commun. 115, 1-7] these thioredoxins contain the same active-site sequence as thioredoxins from other sources. Based on the amino acid sequence thioredoxin mc contains 103 residues, has a relative molecular mass of 11425 and a molar absorption coefficient at 280 nm of 19 300 M-1 cm-1. The spinach thioredoxin mc has an overall homology of 44% with the thioredoxin from Escherichia coli mainly due to differences in the N-terminal and C-terminal regions.

  15. Quantitative analysis and prediction of G-quadruplex forming sequences in double-stranded DNA

    PubMed Central

    Kim, Minji; Kreig, Alex; Lee, Chun-Ying; Rube, H. Tomas; Calvert, Jacob; Song, Jun S.; Myong, Sua

    2016-01-01

    G-quadruplex (GQ) is a four-stranded DNA structure that can be formed in guanine-rich sequences. GQ structures have been proposed to regulate diverse biological processes including transcription, replication, translation and telomere maintenance. Recent studies have demonstrated the existence of GQ DNA in live mammalian cells and a significant number of potential GQ forming sequences in the human genome. We present a systematic and quantitative analysis of GQ folding propensity on a large set of 438 GQ forming sequences in double-stranded DNA by integrating fluorescence measurement, single-molecule imaging and computational modeling. We find that short minimum loop length and the thymine base are two main factors that lead to high GQ folding propensity. Linear and Gaussian process regression models further validate that the GQ folding potential can be predicted with high accuracy based on the loop length distribution and the nucleotide content of the loop sequences. Our study provides important new parameters that can inform the evaluation and classification of putative GQ sequences in the human genome. PMID:27095201

  16. Prediction of Protein Pairs Sharing Common Active Ligands Using Protein Sequence, Structure, and Ligand Similarity.

    PubMed

    Chen, Yu-Chen; Tolbert, Robert; Aronov, Alex M; McGaughey, Georgia; Walters, W Patrick; Meireles, Lidio

    2016-09-26

    We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). Since the target and the off-targets of a drug share at least a common ligand, i.e., the drug itself, the prediction of positive protein pairs may help identify off-targets. We evaluated representative prot