Science.gov

Sample records for acid sequence predicts

  1. Predicting intrinsic disorder from amino acid sequence.

    PubMed

    Obradovic, Zoran; Peng, Kang; Vucetic, Slobodan; Radivojac, Predrag; Brown, Celeste J; Dunker, A Keith

    2003-01-01

    Blind predictions of intrinsic order and disorder were made on 42 proteins subsequently revealed to contain 9,044 ordered residues, 284 disordered residues in 26 segments of length 30 residues or less, and 281 disordered residues in 2 disordered segments of length greater than 30 residues. The accuracies of the six predictors used in this experiment ranged from 77% to 91% for the ordered regions and from 56% to 78% for the disordered segments. The average of the order and disorder predictions ranged from 73% to 77%. The prediction of disorder in the shorter segments was poor, from 25% to 66% correct, while the prediction of disorder in the longer segments was better, from 75% to 95% correct. Four of the predictors were composed of ensembles of neural networks. This enabled them to deal more efficiently with the large asymmetry in the training data through diversified sampling from the significantly larger ordered set and achieve better accuracy on ordered and long disordered regions. The exclusive use of long disordered regions for predictor training likely contributed to the disparity of the predictions on long versus short disordered regions, while averaging the output values over 61-residue windows to eliminate short predictions of order or disorder probably contributed to the even greater disparity for three of the predictors. This experiment supports the predictability of intrinsic disorder from amino acid sequence. PMID:14579347

  2. Predicting protein disorder by analyzing amino acid sequence

    PubMed Central

    Yang, Jack Y; Yang, Mary Qu

    2008-01-01

    Background Many protein regions and some entire proteins have no definite tertiary structure, presenting instead as dynamic, disorder ensembles under different physiochemical circumstances. These proteins and regions are known as Intrinsically Unstructured Proteins (IUP). IUP have been associated with a wide range of protein functions, along with roles in diseases characterized by protein misfolding and aggregation. Results Identifying IUP is important task in structural and functional genomics. We exact useful features from sequences and develop machine learning algorithms for the above task. We compare our IUP predictor with PONDRs (mainly neural-network-based predictors), disEMBL (also based on neural networks) and Globplot (based on disorder propensity). Conclusion We find that augmenting features derived from physiochemical properties of amino acids (such as hydrophobicity, complexity etc.) and using ensemble method proved beneficial. The IUP predictor is a viable alternative software tool for identifying IUP protein regions and proteins. PMID:18831799

  3. Protein location prediction using atomic composition and global features of the amino acid sequence

    SciTech Connect

    Cherian, Betsy Sheena; Nair, Achuthsankar S.

    2010-01-22

    Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectively used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy.

  4. Fast computational methods for predicting protein structure from primary amino acid sequence

    DOEpatents

    Agarwal, Pratul Kumar

    2011-07-19

    The present invention provides a method utilizing primary amino acid sequence of a protein, energy minimization, molecular dynamics and protein vibrational modes to predict three-dimensional structure of a protein. The present invention also determines possible intermediates in the protein folding pathway. The present invention has important applications to the design of novel drugs as well as protein engineering. The present invention predicts the three-dimensional structure of a protein independent of size of the protein, overcoming a significant limitation in the prior art.

  5. Swfoldrate: predicting protein folding rates from amino acid sequence with sliding window method.

    PubMed

    Cheng, Xiang; Xiao, Xuan; Wu, Zhi-cheng; Wang, Pu; Lin, Wei-zhong

    2013-01-01

    Protein folding is the process by which a protein processes from its denatured state to its specific biologically active conformation. Understanding the relationship between sequences and the folding rates of proteins remains an important challenge. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. In this study, the long-range and short-range contact in protein were used to derive extended version of the pseudo amino acid composition based on sliding window method. This method is capable of predicting the protein folding rates just from the amino acid sequence without the aid of any structural class information. We systematically studied the contributions of individual features to folding rate prediction. The optimal feature selection procedures are adopted by means of combining the forward feature selection and sequential backward selection method. Using the jackknife cross validation test, the method was demonstrated on the large dataset. The predictor was achieved on the basis of multitudinous physicochemical features and statistical features from protein using nonlinear support vector machine (SVM) regression model, the method obtained an excellent agreement between predicted and experimentally observed folding rates of proteins. The correlation coefficient is 0.9313 and the standard error is 2.2692. The prediction server is freely available at http://www.jci-bioinfo.cn/swfrate/input.jsp. PMID:22933332

  6. Nucleotide and predicted amino acid sequences of cloned human and mouse preprocathepsin B cDNAs.

    PubMed Central

    Chan, S J; San Segundo, B; McCormick, M B; Steiner, D F

    1986-01-01

    Cathepsin B is a lysosomal thiol proteinase that may have additional extralysosomal functions. To further our investigations on the structure, mode of biosynthesis, and intracellular sorting of this enzyme, we have determined the complete coding sequences for human and mouse preprocathepsin B by using cDNA clones isolated from human hepatoma and kidney phage libraries. The nucleotide sequences predict that the primary structure of preprocathepsin B contains 339 amino acids organized as follows: a 17-residue NH2-terminal prepeptide sequence followed by a 62-residue propeptide region, 254 residues in mature (single chain) cathepsin B, and a 6-residue extension at the COOH terminus. A comparison of procathepsin B sequences from three species (human, mouse, and rat) reveals that the homology between the propeptides is relatively conserved with a minimum of 68% sequence identity. In particular, two conserved sequences in the propeptide that may be functionally significant include a potential glycosylation site and the presence of a single cysteine at position 59. Comparative analysis of the three sequences also suggests that processing of procathepsin B is a multistep process, during which enzymatically active intermediate forms may be generated. The availability of the cDNA clones will facilitate the identification of possible active or inactive intermediate processive forms as well as studies on the transcriptional regulation of the cathepsin B gene. PMID:3463996

  7. ENTPRISE: An Algorithm for Predicting Human Disease-Associated Amino Acid Substitutions from Sequence Entropy and Predicted Protein Structures

    PubMed Central

    Zhou, Hongyi; Gao, Mu; Skolnick, Jeffrey

    2016-01-01

    The advance of next-generation sequencing technologies has made exome sequencing rapid and relatively inexpensive. A major application of exome sequencing is the identification of genetic variations likely to cause Mendelian diseases. This requires processing large amounts of sequence information and therefore computational approaches that can accurately and efficiently identify the subset of disease-associated variations are needed. The accuracy and high false positive rates of existing computational tools leave much room for improvement. Here, we develop a boosted tree regression machine-learning approach to predict human disease-associated amino acid variations by utilizing a comprehensive combination of protein sequence and structure features. On comparing our method, ENTPRISE, to the state-of-the-art methods SIFT, PolyPhen-2, MUTATIONASSESSOR, MUTATIONTASTER, FATHMM, ENTPRISE exhibits significant improvement. In particular, on a testing dataset consisting of only proteins with balanced disease-associated and neutral variations defined as having the ratio of neutral/disease-associated variations between 0.3 and 3, the Mathews Correlation Coefficient by ENTPRISE is 0.493 as compared to 0.432 by PPH2-HumVar, 0.406 by SIFT, 0.403 by MUTATIONASSESSOR, 0.402 by PPH2-HumDiv, 0.305 by MUTATIONTASTER, and 0.181 by FATHMM. ENTPRISE is then applied to nucleic acid binding proteins in the human proteome. Disease-associated predictions are shown to be highly correlated with the number of protein-protein interactions. Both these predictions and the ENTPRISE server are freely available for academic users as a web service at http://cssb.biology.gatech.edu/entprise/. PMID:26982818

  8. Prediction of Residue Status to Be Protected or Not Protected From Hy-drogen Exchange Using Amino Acid Sequence Only.

    PubMed

    Nikita V, Dovidchenko; Oxana V, Galzitskaya

    2008-01-01

    We have outlined here some structural aspects of local flexibility. Important functional properties are related to flexible segments. We try to predict regions that have been shown to exhibit the highest probability of being folded in the equilibrium intermediate or native state and will be protected from hydrogen exchange using amino acid sequence only. Our approach FoldUnfold for the prediction of unstructured regions has been applied to seven different proteins. For 80% of the residues considered in this paper we can predict correctly their status: will they be protected or not from hydrogen exchange. An additional goal of our study is to assess whether properties inferred using the bioinformatics approach are easily applicable to predict behavior of proteins in solution. PMID:18949078

  9. Prediction of Residue Status to Be Protected or Not Protected From Hy-drogen Exchange Using Amino Acid Sequence Only

    PubMed Central

    Dovidchenko, Nikita V; Galzitskaya, Oxana V

    2008-01-01

    We have outlined here some structural aspects of local flexibility. Important functional properties are related to flexible segments. We try to predict regions that have been shown to exhibit the highest probability of being folded in the equilibrium intermediate or native state and will be protected from hydrogen exchange using amino acid sequence only. Our approach FoldUnfold for the prediction of unstructured regions has been applied to seven different proteins. For 80% of the residues considered in this paper we can predict correctly their status: will they be protected or not from hydrogen exchange. An additional goal of our study is to assess whether properties inferred using the bioinformatics approach are easily applicable to predict behavior of proteins in solution. PMID:18949078

  10. Using Chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach.

    PubMed

    Zhang, Shao-Wu; Chen, Wei; Yang, Feng; Pan, Quan

    2008-10-01

    In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou's pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282-289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2-89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure. PMID:18427713

  11. Prediction of posttranslational modification sites from amino acid sequences with kernel methods.

    PubMed

    Xu, Yan; Wang, Xiaobo; Wang, Yongcui; Tian, Yingjie; Shao, Xiaojian; Wu, Ling-Yun; Deng, Naiyang

    2014-03-01

    Post-translational modification (PTM) is the chemical modification of a protein after its translation and one of the later steps in protein biosynthesis for many proteins. It plays an important role which modifies the end product of gene expression and contributes to biological processes and diseased conditions. However, the experimental methods for identifying PTM sites are both costly and time-consuming. Hence computational methods are highly desired. In this work, a novel encoding method PSPM (position-specific propensity matrices) is developed. Then a support vector machine (SVM) with the kernel matrix computed by PSPM is applied to predict the PTM sites. The experimental results indicate that the performance of new method is better or comparable with the existing methods. Therefore, the new method is a useful computational resource for the identification of PTM sites. A unified standalone software PTMPred is developed. It can be used to predict all types of PTM sites if the user provides the training datasets. The software can be freely downloaded from http://www.aporc.org/doc/wiki/PTMPred. PMID:24291233

  12. Hybridization properties of long nucleic acid probes for detection of variable target sequences, and development of a hybridization prediction algorithm

    PubMed Central

    Öhrmalm, Christina; Jobs, Magnus; Eriksson, Ronnie; Golbob, Sultan; Elfaitouri, Amal; Benachenhou, Farid; Strømme, Maria; Blomberg, Jonas

    2010-01-01

    One of the main problems in nucleic acid-based techniques for detection of infectious agents, such as influenza viruses, is that of nucleic acid sequence variation. DNA probes, 70-nt long, some including the nucleotide analog deoxyribose-Inosine (dInosine), were analyzed for hybridization tolerance to different amounts and distributions of mismatching bases, e.g. synonymous mutations, in target DNA. Microsphere-linked 70-mer probes were hybridized in 3M TMAC buffer to biotinylated single-stranded (ss) DNA for subsequent analysis in a Luminex® system. When mismatches interrupted contiguous matching stretches of 6 nt or longer, it had a strong impact on hybridization. Contiguous matching stretches are more important than the same number of matching nucleotides separated by mismatches into several regions. dInosine, but not 5-nitroindole, substitutions at mismatching positions stabilized hybridization remarkably well, comparable to N (4-fold) wobbles in the same positions. In contrast to shorter probes, 70-nt probes with judiciously placed dInosine substitutions and/or wobble positions were remarkably mismatch tolerant, with preserved specificity. An algorithm, NucZip, was constructed to model the nucleation and zipping phases of hybridization, integrating both local and distant binding contributions. It predicted hybridization more exactly than previous algorithms, and has the potential to guide the design of variation-tolerant yet specific probes. PMID:20864443

  13. Immunoreactivity of polyclonal antibodies generated against the carboxy terminus of the predicted amino acid sequence of the Huntington disease gene

    SciTech Connect

    Alkatib, G.; Graham, R.; Pelmear-Telenius, A.

    1994-09-01

    A cDNA fragment spanning the 3{prime}-end of the Huntington disease gene (from 8052 to 9252) was cloned into a prokaryotic expression vector containing the E. Coli lac promoter and a portion of the coding sequence for {beta}-galactosidase. The truncated {beta}-galactosidase gene was cleaved with BamHl and fused in frame to the BamHl fragment of the Huntington disease gene 3{prime}-end. Expression analysis of proteins made in E. Coli revealed that 20-30% of the total cellular proteins was represented by the {beta}-galactosidase-huntingtin fusion protein. The identity of the Huntington disease protein amino acid sequences was confirmed by protein sequence analysis. Affinity chromatography was used to purify large quantities of the fusion protein from bacterial cell lysates. Affinity-purified proteins were used to immunize New Zealand white rabbits for antibody production. The generated polyclonal antibodies were used to immunoprecipitate the Huntington disease gene product expressed in a neuroblastoma cell line. In this cell line the antibodies precipitated two protein bands of apparent gel migrations of 200 and 150 kd which together, correspond to the calculated molecular weight of the Huntington disease gene product (350 kd). Immunoblotting experiments revealed the presence of a large precursor protein in the range of 350-750 kd which is in agreement with the predicted molecular weight of the protein without post-translational modifications. These results indicate that the huntingtin protein is cleaved into two subunits in this neuroblastoma cell line and implicate that cleavage of a large precursor protein may contribute to its biological activity. Experiments are ongoing to determine the precursor-product relationship and to examine the synthesis of the huntingtin protein in freshly isolated rat brains, and to determine cellular and subcellular distribution of the gene product.

  14. Predicting Secretory Proteins of Malaria Parasite by Incorporating Sequence Evolution Information into Pseudo Amino Acid Composition via Grey System Model

    PubMed Central

    Lin, Wei-Zhong; Fang, Jian-An; Xiao, Xuan; Chou, Kuo-Chen

    2012-01-01

    The malaria disease has become a cause of poverty and a major hindrance to economic development. The culprit of the disease is the parasite, which secretes an array of proteins within the host erythrocyte to facilitate its own survival. Accordingly, the secretory proteins of malaria parasite have become a logical target for drug design against malaria. Unfortunately, with the increasing resistance to the drugs thus developed, the situation has become more complicated. To cope with the drug resistance problem, one strategy is to timely identify the secreted proteins by malaria parasite, which can serve as potential drug targets. However, it is both expensive and time-consuming to identify the secretory proteins of malaria parasite by experiments alone. To expedite the process for developing effective drugs against malaria, a computational predictor called “iSMP-Grey” was developed that can be used to identify the secretory proteins of malaria parasite based on the protein sequence information alone. During the prediction process a protein sample was formulated with a 60D (dimensional) feature vector formed by incorporating the sequence evolution information into the general form of PseAAC (pseudo amino acid composition) via a grey system model, which is particularly useful for solving complicated problems that are lack of sufficient information or need to process uncertain information. It was observed by the jackknife test that iSMP-Grey achieved an overall success rate of 94.8%, remarkably higher than those by the existing predictors in this area. As a user-friendly web-server, iSMP-Grey is freely accessible to the public at http://www.jci-bioinfo.cn/iSMP-Grey. Moreover, for the convenience of most experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematical equations involved in this paper. PMID:23189138

  15. Nucleotide and predicted amino acid sequence of a cDNA clone encoding part of human transketolase.

    PubMed

    Abedinia, M; Layfield, R; Jones, S M; Nixon, P F; Mattick, J S

    1992-03-31

    Transketolase is a key enzyme in the pentose-phosphate pathway which has been implicated in the latent human genetic disease, Wernicke-Korsakoff syndrome. Here we report the cloning and partial characterisation of the coding sequences encoding human transketolase from a human brain cDNA library. The library was screened with oligonucleotide probes based on the amino acid sequence of proteolytic fragments of the purified protein. Northern blots showed that the transketolase mRNA is approximately 2.2 kb, close to the minimum expected, of which approximately 60% was represented in the largest cDNA clone. Sequence analysis of the transketolase coding sequences reveals a number of homologies with related enzymes from other species. PMID:1567394

  16. A novel T-cell-defined HLA-DR polymorphism not predicted from the linear amino acid sequence.

    PubMed

    Termijtelen, A; van den Elsen, P; Koning, F; de Koster, S; Schroeijers, W; Vanderkerckhove, B

    1989-09-01

    Recent investigations have shown that alloreactive T cells are capable of responding to structures defined by specific linear amino acid sequences on class II molecules. In the present study we show that also a polymorphism can be recognized that is not defined by such linear amino acid sequences. Two human T-cell clones, sensitized to DRw13 haplotypes, are described. The description of clone c50 serves to exemplify the first model. This DRB1-specific clone responds to stimulator cells that carry DR molecules, different in their DRB1 first and second hypervariable regions (HV1 and HV2) but identical in their HV3 regions (i.e., DRw13,Dw18; DRw13,Dw19; DR4,Dw10; and DRw11,LDVII). The second clone, c1443, behaves nonconventionally. It responds to DRw13,Dw18; DRw13,Dw19; and DR4,Dw4 stimulator cells, although no specific amino acid sequence is shared between these specificities. The latter pattern of reactivity suggests the existence of a novel polymorphism recognized by alloreactive T cells. This particular polymorphism may also be biologically significant. PMID:2476425

  17. Amino acid sequence of the ligand-binding domain of the aryl hydrocarbon receptor 1 predicts sensitivity of wild birds to effects of dioxin-like compounds.

    PubMed

    Farmahin, Reza; Manning, Gillian E; Crump, Doug; Wu, Dongmei; Mundy, Lukas J; Jones, Stephanie P; Hahn, Mark E; Karchner, Sibel I; Giesy, John P; Bursian, Steven J; Zwiernik, Matthew J; Fredricks, Timothy B; Kennedy, Sean W

    2013-01-01

    The sensitivity of avian species to the toxic effects of dioxin-like compounds (DLCs) varies up to 1000-fold among species, and this variability has been associated with interspecies differences in aryl hydrocarbon receptor 1 ligand-binding domain (AHR1 LBD) sequence. We previously showed that LD(50) values, based on in ovo exposures to DLCs, were significantly correlated with in vitro EC(50) values obtained with a luciferase reporter gene (LRG) assay that measures AHR1-mediated induction of cytochrome P4501A in COS-7 cells transfected with avian AHR1 constructs. Those findings suggest that the AHR1 LBD sequence and the LRG assay can be used to predict avian species sensitivity to DLCs. In the present study, the AHR1 LBD sequences of 86 avian species were studied, and differences at amino acid sites 256, 257, 297, 324, 337, and 380 were identified. Site-directed mutagenesis, the LRG assay, and homology modeling highlighted the importance of each amino acid site in AHR1 sensitivity to 2,3,7,8-tetrachlorodibenzo-p-dioxin and other DLCs. The results of the study revealed that (1) only amino acids at sites 324 and 380 affect the sensitivity of AHR1 expression constructs of the 86 avian species to DLCs and (2) in vitro luciferase activity of AHR1 constructs containing only the LBD of the species of interest is significantly correlated (r (2) = 0.93, p < 0.0001) with in ovo toxicity data for those species. These results indicate promise for the use of AHR1 LBD amino acid sequences independently, or combined with the LRG assay, to predict avian species sensitivity to DLCs. PMID:22923492

  18. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

    PubMed Central

    Wu, Jiansheng; Liu, Hongde; Duan, Xueye; Ding, Yan; Wu, Hongtao; Bai, Yunfei; Sun, Xiao

    2009-01-01

    Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. Contact: xsun@seu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19008251

  19. Composition for nucleic acid sequencing

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2008-08-26

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  20. Identification of G and P genotype-specific motifs in the predicted VP7 and VP4 amino acid sequences.

    PubMed

    Ma, Yongping

    2015-12-01

    Equine rotavirus (ERV) strain L338 (G13P[18]) has a unique G and P genotype. However, the evolutionary relationship of L338 with other ERVs is still unknown. Here whole genome analysis of the L338 ERV strain was independently performed. Its genotype constellations were determined as G13-P[18]-I6-R9-C9-M6-A6-N9-T12-E14-H11, confirming previous genotype assignments. The L338 strain only shared the P[18] and I6 genotypes with other ERVs. The nucleotide sequences of the other 9 RNA segments were different from those of cogent genes of all other group A rotavirus (RVA) strains including ERVs and formed unique phylogenetic lineages. The L338 evolutionary footprints were tentatively identified in both VP7 and VP4 amino acid sequences: two regions were found in VP7 and twelve in VP4. The conserved regions shared between L338 and other group A rotavirus strains (RVAs) indicated that L338 was more closely related genomically to animal and human RVAs other than ERVs, suggesting that L338 may not be an endogenous equine RV but have emerged as an interspecies reassortant with other RVA strains. Furthermore, genotype-specific motifs of all 27 G and 37 P types were identified in regions 7-1a (aa 91-100) of VP7 and regions 8-1 (aa146-151) and 8-3 (aa113-118 and 125-135) of VP4 (VP8*). PMID:26321159

  1. Characterization of cDNA clones for human myeloperoxidase: predicted amino acid sequence and evidence for multiple mRNA species.

    PubMed Central

    Johnson, K R; Nauseef, W M; Care, A; Wheelock, M J; Shane, S; Hudson, S; Koeffler, H P; Selsted, M; Miller, C; Rovera, G

    1987-01-01

    Myeloperoxidase is a component of the microbicidal network of polymorphonuclear leukocytes. The enzyme is a tetramer consisting of two heavy and two light subunits. A large proportion of humans demonstrate genetic deficiencies in the production of myeloperoxidase. As a first step in analyzing these deficiencies in more detail, we have isolated cDNA clones for myeloperoxidase from an expression library of the HL-60 human promyelocytic leukemia cell line. Two overlapping plasmids (pMP02 and pMP062) were identified as myeloperoxidase cDNA clones based on the detection with myeloperoxidase antiserum of 70 kDa protein expressed in pMP02-containing bacteria and a 75 kDa polypeptide produced by hybridization selection and translation using pMP062 and HL-60 RNA. Formal identification of the clones was made by matching the predicted amino acid sequences with the amino terminal sequences of the heavy and light subunits. Both subunits are encoded by one mRNA in the following order: pre-pro-sequences--light subunit--heavy subunit. The molecular weight of the predicted primary translation product is 83.7 kDa. Northern blots reveal two size classes of hybridizing RNAs (approximately 3.0-3.3 and 3.5-4.0 kilobases) whose expression is restricted to cells of the granulocytic lineage and parallels the changes in enzymatic activity observed during differentiation. Images PMID:3031585

  2. Alloantibody Responses After Renal Transplant Failure Can Be Better Predicted by Donor-Recipient HLA Amino Acid Sequence and Physicochemical Disparities Than Conventional HLA Matching.

    PubMed

    Kosmoliaptsis, V; Mallon, D H; Chen, Y; Bolton, E M; Bradley, J A; Taylor, C J

    2016-07-01

    We have assessed whether HLA immunogenicity as defined by differences in donor-recipient HLA amino-acid sequence (amino-acid mismatch score, AMS; and eplet mismatch score, EpMS) and physicochemical properties (electrostatic mismatch score, EMS) enables prediction of allosensitization to HLA, and also prediction of the risk of an individual donor-recipient HLA mismatch to induce donor-specific antibody (DSA). HLA antibody screening was undertaken using single-antigen beads in 131 kidney transplant recipients returning to the transplant waiting list following first graft failure. The effect of AMS, EpMS, and EMS on the development of allosensitization (calculated reaction frequency [cRF]) and DSA was determined. Multivariate analyses, adjusting for time on the waiting list, maintenance on immunosuppression after transplant failure, and graft nephrectomy, showed that AMS (odds ratio [OR]: 1.44 per 10 units, 95% CI: 1.02-2.10, p = 0.04) and EMS (OR: 1.27 per 10 units, 95% CI: 1.02-1.62, p = 0.04) were independently associated with the risk of developing sensitization to HLA (cRF > 15%). AMS, EpMS, and EMS were independently associated with the development of HLA-DR and HLA-DQ DSA, but only EMS correlated with the risk of HLA-A and -B DSA development. Differences in donor-recipient HLA amino-acid sequence and physicochemical properties enable better assessment of the risk of HLA-specific sensitization than conventional HLA matching. PMID:26755448

  3. High speed nucleic acid sequencing

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2011-05-17

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid. Each type of labeled nucleotide comprises an acceptor fluorophore attached to a phosphate portion of the nucleotide such that the fluorophore is removed upon incorporation into a growing strand. Fluorescent signal is emitted via fluorescent resonance energy transfer between the donor fluorophore and the acceptor fluorophore as each nucleotide is incorporated into the growing strand. The sequence is deduced by identifying which base is being incorporated into the growing strand.

  4. Amino acid sequence of the AhR1 ligand-binding domain predicts avian sensitivity to dioxin like compounds: in vivo verification in European starlings.

    PubMed

    Eng, Margaret L; Elliott, John E; Jones, Stephanie P; Williams, Tony D; Drouillard, Ken G; Kennedy, Sean W

    2014-12-01

    Research has demonstrated that the sensitivity of avian species to the embyrotoxic effects of dioxin-like compounds can be predicted by the amino acid identities at two key sites within the ligand-binding domain of the aryl hydrocarbon receptor 1 (AhR1). The domestic chicken (Gallus gallus domesticus) has been established as a highly sensitive species to the toxic effects of dioxin-like compounds. Results from genotyping and in vitro assays predict that the European starling (Sturnus vulgaris) is also highly sensitive to dioxin-like compound toxicity. The objective of the present study was to test that prediction in vivo. To do this, we used egg injections in field nesting starlings with 3,3',4,4',5-pentachlorobiphenyl (PCB-126), a dioxin-like polychlorinated biphenyl. Eggs were dosed with either the vehicle control or 1 of 5 doses (1.4, 7.1, 15.9, 32.1, and 52.9 ng PCB-126/g egg). A dose-dependent increase in embryo mortality occurred, and the median lethal dose (LD50; 95% confidence interval [CI]) was 5.61 (2.33-9.08) ng/g. Hepatic CYP1A4/5 messenger RNA (mRNA) expression in hatchlings also increased in a dose-dependent manner, with CYP1A4 being more induced than CYP1A5. No effect of dose on morphological measures was seen, and we did not observe any overt malformations. These results indicate that, other than the chicken, the European starling is the most sensitive species to the effects of PCB-126 on avian embryo mortality reported to date, which supports the prediction of relative sensitivity to dioxin-like compounds based on amino acid sequence of the AhR1. PMID:25209921

  5. Predicting the molecular complexity of sequencing libraries.

    PubMed

    Daley, Timothy; Smith, Andrew D

    2013-04-01

    Predicting the molecular complexity of a genomic sequencing library is a critical but difficult problem in modern sequencing applications. Methods to determine how deeply to sequence to achieve complete coverage or to predict the benefits of additional sequencing are lacking. We introduce an empirical bayesian method to accurately characterize the molecular complexity of a DNA sample for almost any sequencing application on the basis of limited preliminary sequencing. PMID:23435259

  6. Discrete sequence prediction and its applications

    NASA Technical Reports Server (NTRS)

    Laird, Philip

    1992-01-01

    Learning from experience to predict sequences of discrete symbols is a fundamental problem in machine learning with many applications. We apply sequence prediction using a simple and practical sequence-prediction algorithm, called TDAG. The TDAG algorithm is first tested by comparing its performance with some common data compression algorithms. Then it is adapted to the detailed requirements of dynamic program optimization, with excellent results.

  7. Chip-based sequencing nucleic acids

    DOEpatents

    Beer, Neil Reginald

    2014-08-26

    A system for fast DNA sequencing by amplification of genetic material within microreactors, denaturing, demulsifying, and then sequencing the material, while retaining it in a PCR/sequencing zone by a magnetic field. One embodiment includes sequencing nucleic acids on a microchip that includes a microchannel flow channel in the microchip. The nucleic acids are isolated and hybridized to magnetic nanoparticles or to magnetic polystyrene-coated beads. Microreactor droplets are formed in the microchannel flow channel. The microreactor droplets containing the nucleic acids and the magnetic nanoparticles are retained in a magnetic trap in the microchannel flow channel and sequenced.

  8. Distinguishing Proteins From Arbitrary Amino Acid Sequences

    PubMed Central

    Yau, Stephen S.-T.; Mao, Wei-Guang; Benson, Max; He, Rong Lucy

    2015-01-01

    What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe. PMID:25609314

  9. Method for sequencing nucleic acid molecules

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2006-05-30

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  10. Method for sequencing nucleic acid molecules

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2006-06-06

    The present invention is directed to a method of sequencing a target nucleic acid molecule having a plurality of bases. In its principle, the temporal order of base additions during the polymerization reaction is measured on a molecule of nucleic acid, i.e. the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. The sequence is deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labelled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labelled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended and the sequence of the target nucleic acid is determined.

  11. Protein structure prediction from sequence variation

    PubMed Central

    Marks, Debora S; Hopf, Thomas A; Sander, Chris

    2015-01-01

    Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics. PMID:23138306

  12. Lossless Video Sequence Compression Using Adaptive Prediction

    NASA Technical Reports Server (NTRS)

    Li, Ying; Sayood, Khalid

    2007-01-01

    We present an adaptive lossless video compression algorithm based on predictive coding. The proposed algorithm exploits temporal, spatial, and spectral redundancies in a backward adaptive fashion with extremely low side information. The computational complexity is further reduced by using a caching strategy. We also study the relationship between the operational domain for the coder (wavelet or spatial) and the amount of temporal and spatial redundancy in the sequence being encoded. Experimental results show that the proposed scheme provides significant improvements in compression efficiencies.

  13. Analysis and Annotation of Nucleic Acid Sequence

    SciTech Connect

    States, David J.

    2004-07-28

    The aims of this project were to develop improved methods for computational genome annotation and to apply these methods to improve the annotation of genomic sequence data with a specific focus on human genome sequencing. The project resulted in a substantial body of published work. Notable contributions of this project were the identification of basecalling and lane tracking as error processes in genome sequencing and contributions to improved methods for these steps in genome sequencing. This technology improved the accuracy and throughput of genome sequence analysis. Probabilistic methods for physical map construction were developed. Improved methods for sequence alignment, alternative splicing analysis, promoter identification and NF kappa B response gene prediction were also developed.

  14. Analysis and Annotation of Nucleic Acid Sequence

    SciTech Connect

    David J. States

    1998-08-01

    The aims of this project were to develop improved methods for computational genome annotation and to apply these methods to improve the annotation of genomic sequence data with a specific focus on human genome sequencing. The project resulted in a substantial body of published work. Notable contributions of this project were the identification of basecalling and lane tracking as error processes in genome sequencing and contributions to improved methods for these steps in genome sequencing. This technology improved the accuracy and throughput of genome sequence analysis. Probabilistic methods for physical map construction were developed. Improved methods for sequence alignment, alternative splicing analysis, promoter identification and NF kappa B response gene prediction were also developed.

  15. Sequence polymorphism of the predicted human metapneumovirus G glycoprotein.

    PubMed

    Peret, Teresa C T; Abed, Yacine; Anderson, Larry J; Erdman, Dean D; Boivin, Guy

    2004-03-01

    The putative G glycoprotein genes of 25 human metapneumovirus (hMPV) field isolates obtained during five consecutive epidemic seasons (1997 to 2002) were sequenced. Sequence alignments identified two major genetic groups, designated groups 1 and 2, and two minor genetic clusters within each major group, designated subgroups A and B. Extensive nucleotide and deduced amino acid sequence variability was observed, consisting of high rates of nucleotide substitutions, use of alternative transcription-termination codons and insertions that retained the reading frame. Deduced amino acid sequences showed the greatest variability, with most differences located in the extracellular domain of the protein: nucleotide and amino acid sequence identities for the entire open reading frame ranged from 52 to 58 % and 31 to 35 %, respectively, between the two major groups. Like the closely related avian pneumovirus and human and bovine respiratory syncytial viruses, the predicted G protein of hMPV shared the basic features of a type II mucin-like glycosylated protein. However, differences from these related viruses were also observed, e.g. lack of conserved cysteine clusters as seen in human respiratory syncytial virus and avian pneumovirus. The displacement of genetic groups of hMPV observed during the study period suggests that potential antigenic differences in the G glycoprotein, which have evolved in response to immune-mediated pressure, may influence the circulation patterns of hMPV strains. PMID:14993653

  16. Predicting protein-protein interactions based only on sequences information.

    PubMed

    Shen, Juwen; Zhang, Jian; Luo, Xiaomin; Zhu, Weiliang; Yu, Kunqian; Chen, Kaixian; Li, Yixue; Jiang, Hualiang

    2007-03-13

    Protein-protein interactions (PPIs) are central to most biological processes. Although efforts have been devoted to the development of methodology for predicting PPIs and protein interaction networks, the application of most existing methods is limited because they need information about protein homology or the interaction marks of the protein partners. In the present work, we propose a method for PPI prediction using only the information of protein sequences. This method was developed based on a learning algorithm-support vector machine combined with a kernel function and a conjoint triad feature for describing amino acids. More than 16,000 diverse PPI pairs were used to construct the universal model. The prediction ability of our approach is better than that of other sequence-based PPI prediction methods because it is able to predict PPI networks. Different types of PPI networks have been effectively mapped with our method, suggesting that, even with only sequence information, this method could be applied to the exploration of networks for any newly discovered protein with unknown biological relativity. In addition, such supplementary experimental information can enhance the prediction ability of the method. PMID:17360525

  17. Phenolic acid esterases, coding sequences and methods

    DOEpatents

    Blum, David L.; Kataeva, Irina; Li, Xin-Liang; Ljungdahl, Lars G.

    2002-01-01

    Described herein are four phenolic acid esterases, three of which correspond to domains of previously unknown function within bacterial xylanases, from XynY and XynZ of Clostridium thermocellum and from a xylanase of Ruminococcus. The fourth specifically exemplified xylanase is a protein encoded within the genome of Orpinomyces PC-2. The amino acids of these polypeptides and nucleotide sequences encoding them are provided. Recombinant host cells, expression vectors and methods for the recombinant production of phenolic acid esterases are also provided.

  18. The predictive capacity of personal genome sequencing.

    PubMed

    Roberts, Nicholas J; Vogelstein, Joshua T; Parmigiani, Giovanni; Kinzler, Kenneth W; Vogelstein, Bert; Velculescu, Victor E

    2012-05-01

    New DNA sequencing methods will soon make it possible to identify all germline variants in any individual at a reasonable cost. However, the ability of whole-genome sequencing to predict predisposition to common diseases in the general population is unknown. To estimate this predictive capacity, we use the concept of a "genometype." A specific genometype represents the genomes in the population conferring a specific level of genetic risk for a specified disease. Using this concept, we estimated the maximum capacity of whole-genome sequencing to identify individuals at clinically significant risk for 24 different diseases. Our estimates were derived from the analysis of large numbers of monozygotic twin pairs; twins of a pair share the same genometype and therefore identical genetic risk factors. Our analyses indicate that (i) for 23 of the 24 diseases, most of the individuals will receive negative test results; (ii) these negative test results will, in general, not be very informative, because the risk of developing 19 of the 24 diseases in those who test negative will still be, at minimum, 50 to 80% of that in the general population; and (iii) on the positive side, in the best-case scenario, more than 90% of tested individuals might be alerted to a clinically significant predisposition to at least one disease. These results have important implications for the valuation of genetic testing by industry, health insurance companies, public policy-makers, and consumers. PMID:22472521

  19. Amino-Acid Sequence of Porcine Pepsin

    PubMed Central

    Tang, J.; Sepulveda, P.; Marciniszyn, J.; Chen, K. C. S.; Huang, W-Y.; Tao, N.; Liu, D.; Lanier, J. P.

    1973-01-01

    As the culmination of several years of experiments, we propose a complete amino-acid sequence for porcine pepsin, an enzyme containing 327 amino-acid residues in a single polypeptide chain. In the sequence determination, the enzyme was treated with cyanogen bromide. Five resulting fragments were purified. The amino-acid sequence of four of the fragments accounted for 290 residues. Because the structure of a 37-residue carboxyl-terminal fragment was already known, it was not studied. The alignment of these fragments was determined from the sequence of methionyl-peptides we had previously reported. We also discovered the locations of activesite aspartyl residues, as well as the pairing of the three disulfide bridges. A minor component of commercial crystalline pepsin was found to contain two extra amino-acid residues, Ala-Leu-, at the amino-terminus of the molecule. This minor component was apparently derived from a different site of cleavage during the activation of porcine pepsinogen. PMID:4587252

  20. Method for identifying and quantifying nucleic acid sequence aberrations

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    1998-01-01

    A method for detecting nucleic acid sequence aberrations by detecting nucleic acid sequences having both a first and a second nucleic acid sequence type, the presence of the first and second sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. The method uses a first hybridization probe which includes a nucleic acid sequence that is complementary to a first sequence type and a first complexing agent capable of attaching to a second complexing agent and a second hybridization probe which includes a nucleic acid sequence that selectively hybridizes to the second nucleic acid sequence type over the first sequence type and includes a detectable marker for detecting the second hybridization probe.

  1. Method for identifying and quantifying nucleic acid sequence aberrations

    DOEpatents

    Lucas, J.N.; Straume, T.; Bogen, K.T.

    1998-07-21

    A method is disclosed for detecting nucleic acid sequence aberrations by detecting nucleic acid sequences having both a first and a second nucleic acid sequence type, the presence of the first and second sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. The method uses a first hybridization probe which includes a nucleic acid sequence that is complementary to a first sequence type and a first complexing agent capable of attaching to a second complexing agent and a second hybridization probe which includes a nucleic acid sequence that selectively hybridizes to the second nucleic acid sequence type over the first sequence type and includes a detectable marker for detecting the second hybridization probe. 11 figs.

  2. The DynaMine webserver: predicting protein dynamics from sequence.

    PubMed

    Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F

    2014-07-01

    Protein dynamics are important for understanding protein function. Unfortunately, accurate protein dynamics information is difficult to obtain: here we present the DynaMine webserver, which provides predictions for the fast backbone movements of proteins directly from their amino-acid sequence. DynaMine rapidly produces a profile describing the statistical potential for such movements at residue-level resolution. The predicted values have meaning on an absolute scale and go beyond the traditional binary classification of residues as ordered or disordered, thus allowing for direct dynamics comparisons between protein regions. Through this webserver, we provide molecular biologists with an efficient and easy to use tool for predicting the dynamical characteristics of any protein of interest, even in the absence of experimental observations. The prediction results are visualized and can be directly downloaded. The DynaMine webserver, including instructive examples describing the meaning of the profiles, is available at http://dynamine.ibsquare.be. PMID:24728994

  3. Better prediction of functional effects for sequence variants

    PubMed Central

    2015-01-01

    Elucidating the effects of naturally occurring genetic variation is one of the major challenges for personalized health and personalized medicine. Here, we introduce SNAP2, a novel neural network based classifier that improves over the state-of-the-art in distinguishing between effect and neutral variants. Our method's improved performance results from screening many potentially relevant protein features and from refining our development data sets. Cross-validated on >100k experimentally annotated variants, SNAP2 significantly outperformed other methods, attaining a two-state accuracy (effect/neutral) of 83%. SNAP2 also outperformed combinations of other methods. Performance increased for human variants but much more so for other organisms. Our method's carefully calibrated reliability index informs selection of variants for experimental follow up, with the most strongly predicted half of all effect variants predicted at over 96% accuracy. As expected, the evolutionary information from automatically generated multiple sequence alignments gave the strongest signal for the prediction. However, we also optimized our new method to perform surprisingly well even without alignments. This feature reduces prediction runtime by over two orders of magnitude, enables cross-genome comparisons, and renders our new method as the best solution for the 10-20% of sequence orphans. SNAP2 is available at: https://rostlab.org/services/snap2web Definitions used Delta, input feature that results from computing the difference feature scores for native amino acid and feature scores for variant amino acid; nsSNP, non-synoymous SNP; PMD, Protein Mutant Database; SNAP, Screening for non-acceptable polymorphisms; SNP, single nucleotide polymorphism; variant, any amino acid changing sequence variant. PMID:26110438

  4. Methods for analyzing nucleic acid sequences

    DOEpatents

    Korlach, Jonas; Webb, Watt W.; Levene, Michael; Turner, Stephen; Craighead, Harold G.; Foquet, Mathieu

    2011-05-17

    The present invention is directed to a method of sequencing a target nucleic acid. The method provides a complex comprising a polymerase enzyme, a target nucleic acid molecule, and a primer, wherein the complex is immobilized on a support Fluorescent label is attached to a terminal phosphate group of the nucleotide or nucleotide analog. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The time duration of the signal from labeled nucleotides or nucleotide analogs that become incorporated is distinguished from freely diffusing labels by a longer retention in the observation volume for the nucleotides or nucleotide analogs that become incorporated than for the freely diffusing labels.

  5. Why do Sequence Signatures Predict Enzyme Mechanism? Homology versus Chemistry

    PubMed Central

    Beattie, Kirsten E.; De Ferrari, Luna; Mitchell, John B. O.

    2015-01-01

    First, we identify InterPro sequence signatures representing evolutionary relatedness and, second, signatures identifying specific chemical machinery. Thus, we predict the chemical mechanisms of enzyme-catalyzed reactions from catalytic and non-catalytic subsets of InterPro signatures. We first scanned our 249 sequences using InterProScan and then used the MACiE database to identify those amino acid residues that are important for catalysis. The sequences were mutated in silico to replace these catalytic residues with glycine and then again scanned using InterProScan. Those signature matches from the original scan that disappeared on mutation were called catalytic. Mechanism was predicted using all signatures, only the 78 “catalytic” signatures, or only the 519 “non-catalytic” signatures. The non-catalytic signatures gave indistinguishable results from those for the whole feature set, with precision of 0.991 and sensitivity of 0.970. The catalytic signatures alone gave less impressive predictivity, with precision and sensitivity of 0.791 and 0.735, respectively. These results show that our successful prediction of enzyme mechanism is mostly by homology rather than by identifying catalytic machinery. PMID:26740739

  6. Carbonate sequence stratigraphy, diagenesis, and porosity prediction

    SciTech Connect

    Tucker, M.E. )

    1993-09-01

    Considering carbonate rocks in the context of changes of relative sea level and accommodation space enables a degree of prediction of sediment body geometry and stacking patterns and of the course of early diagenesis and evolution of porosity. During a major sea level fall and in a humid climate, the sediments of the previous highstand systems tracts (HST) and transgressive systems tracts (TST) are subjected to meteoric leaching and cementation, and karstification from the sequence boundary. Both porosity occlusion and enhancement may occur. In an arid climate, reflux dolomitization is likely to be important. TST facies are typified by marine cementation followed by burial in marine pore fluids where no significant diagenetic reactions take place until compaction begins or meteoric flushing occurs. TST facies have major reservoir potential, commonly retaining significant primary porosity into the deep burial realm. If dolomitization by circulating seawater is an important process, then it is most likely to occur during the TST, when the relative sea level rise pushes marine groundwaters through the sediments. Very porous rocks can be produced in this way if there is concominant aragonite dissolution. During the HST, sediments may be subjected to marine cementation, but this would soon be followed by meteoric diagenesis in a humid climate or by evaporative dolomitization if the climate is arid. Many carbonate platforms consist of numerous parasequences and their diagenesis depends on their position within the sequence. Those parasequences deposited during the third-order sea level fall generally show the effects of surface-related diagenesis (supratidal dolomitization or karstification) to a much greater degree than those deposited during the third-order sea level rise. Relative sea level changes have varied through time and these have had a strong influence on the nature of sequences and parasequences, as well as on their diagenesis.

  7. Prediction of neddylation sites from protein sequences and sequence-derived properties

    PubMed Central

    2015-01-01

    Background Neddylation is a reversible post-translational modification that plays a vital role in maintaining cellular machinery. It is shown to affect localization, binding partners and structure of target proteins. Disruption of protein neddylation was observed in various diseases such as Alzheimer's and cancer. Therefore, understanding the neddylation mechanism and determining neddylation targets possibly bears a huge importance in further understanding the cellular processes. This study is the first attempt to predict neddylated sites from protein sequences by using several sequence and sequence-based structural features. Results We have developed a neddylation site prediction method using a support vector machine based on various sequence properties, position-specific scoring matrices, and disorder. Using 21 amino acid long lysine-centred windows, our model was able to predict neddylation sites successfully, with an average 5-fold stratified cross validation performance of 0.91, 0.91, 0.75, 0.44, 0.95 for accuracy, specificity, sensitivity, Matthew's correlation coefficient and area under curve, respectively. Independent test set results validated the robustness of reported new method. Additionally, we observed that neddylation sites are commonly flexible and there is a significant positively charged amino acid presence in neddylation sites. Conclusions In this study, a neddylation site prediction method was developed for the first time in literature. Common characteristics of neddylation sites and their discriminative properties were explored for further in silico studies on neddylation. Lastly, up-to-date neddylation dataset was provided for researchers working on post-translational modifications in the accompanying supplementary material of this article. PMID:26679222

  8. Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals

    PubMed Central

    2014-01-01

    Background Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for tasks such as the prediction of transcription factor binding sites. Results Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification performance when combined with some traditional sequence features such as amino acid composition. However a post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations. Conclusion We report the results of the first quantitative study of the effectiveness of evolutionary sequence divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it succeeds where other features fail. PMID:24438075

  9. Predictive uncertainty in auditory sequence processing

    PubMed Central

    Hansen, Niels Chr.; Pearce, Marcus T.

    2014-01-01

    Previous studies of auditory expectation have focused on the expectedness perceived by listeners retrospectively in response to events. In contrast, this research examines predictive uncertainty—a property of listeners' prospective state of expectation prior to the onset of an event. We examine the information-theoretic concept of Shannon entropy as a model of predictive uncertainty in music cognition. This is motivated by the Statistical Learning Hypothesis, which proposes that schematic expectations reflect probabilistic relationships between sensory events learned implicitly through exposure. Using probability estimates from an unsupervised, variable-order Markov model, 12 melodic contexts high in entropy and 12 melodic contexts low in entropy were selected from two musical repertoires differing in structural complexity (simple and complex). Musicians and non-musicians listened to the stimuli and provided explicit judgments of perceived uncertainty (explicit uncertainty). We also examined an indirect measure of uncertainty computed as the entropy of expectedness distributions obtained using a classical probe-tone paradigm where listeners rated the perceived expectedness of the final note in a melodic sequence (inferred uncertainty). Finally, we simulate listeners' perception of expectedness and uncertainty using computational models of auditory expectation. A detailed model comparison indicates which model parameters maximize fit to the data and how they compare to existing models in the literature. The results show that listeners experience greater uncertainty in high-entropy musical contexts than low-entropy contexts. This effect is particularly apparent for inferred uncertainty and is stronger in musicians than non-musicians. Consistent with the Statistical Learning Hypothesis, the results suggest that increased domain-relevant training is associated with an increasingly accurate cognitive model of probabilistic structure in music. PMID:25295018

  10. Predictive uncertainty in auditory sequence processing.

    PubMed

    Hansen, Niels Chr; Pearce, Marcus T

    2014-01-01

    Previous studies of auditory expectation have focused on the expectedness perceived by listeners retrospectively in response to events. In contrast, this research examines predictive uncertainty-a property of listeners' prospective state of expectation prior to the onset of an event. We examine the information-theoretic concept of Shannon entropy as a model of predictive uncertainty in music cognition. This is motivated by the Statistical Learning Hypothesis, which proposes that schematic expectations reflect probabilistic relationships between sensory events learned implicitly through exposure. Using probability estimates from an unsupervised, variable-order Markov model, 12 melodic contexts high in entropy and 12 melodic contexts low in entropy were selected from two musical repertoires differing in structural complexity (simple and complex). Musicians and non-musicians listened to the stimuli and provided explicit judgments of perceived uncertainty (explicit uncertainty). We also examined an indirect measure of uncertainty computed as the entropy of expectedness distributions obtained using a classical probe-tone paradigm where listeners rated the perceived expectedness of the final note in a melodic sequence (inferred uncertainty). Finally, we simulate listeners' perception of expectedness and uncertainty using computational models of auditory expectation. A detailed model comparison indicates which model parameters maximize fit to the data and how they compare to existing models in the literature. The results show that listeners experience greater uncertainty in high-entropy musical contexts than low-entropy contexts. This effect is particularly apparent for inferred uncertainty and is stronger in musicians than non-musicians. Consistent with the Statistical Learning Hypothesis, the results suggest that increased domain-relevant training is associated with an increasingly accurate cognitive model of probabilistic structure in music. PMID:25295018

  11. Detection of nucleic acid sequences by invader-directed cleavage

    DOEpatents

    Brow, Mary Ann D.; Hall, Jeff Steven Grotelueschen; Lyamichev, Victor; Olive, David Michael; Prudent, James Robert

    1999-01-01

    The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The 5' nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof. The present invention further relates to methods and devices for the separation of nucleic acid molecules based by charge.

  12. Structural gene and complete amino acid sequence of Pseudomonas aeruginosa IFO 3455 elastase.

    PubMed Central

    Fukushima, J; Yamamoto, S; Morihara, K; Atsumi, Y; Takeuchi, H; Kawamoto, S; Okuda, K

    1989-01-01

    The DNA encoding the elastase of Pseudomonas aeruginosa IFO 3455 was cloned, and its complete nucleotide sequence was determined. When the cloned gene was ligated to pUC18, the Escherichia coli expression vector, bacteria carrying the gene exhibited high levels of both elastase activity and elastase antigens. The amino acid sequence, deduced from the nucleotide sequence, revealed that the mature elastase consisted of 301 amino acids with a relative molecular mass of 32,926 daltons. The amino acid composition predicted from the DNA sequence was quite similar to the chemically determined composition of purified elastase reported previously. We also observed nucleotide sequence encoding a signal peptide and "pro" sequence consisting of 197 amino acids upstream from the mature elastase protein gene. The amino acid sequence analysis revealed that both the N-terminal sequence of the purified elastase and the N-terminal side sequences of the C-terminal tryptic peptide as well as the internal lysyl peptide fragment were completely identical to the deduced amino acid sequences. The pattern of identity of amino acid sequences was quite evident in the regions that include structurally and functionally important residues of Bacillus subtilis thermolysin. PMID:2493453

  13. Characterization and amino acid sequence of a fatty acid-binding protein from human heart.

    PubMed

    Offner, G D; Brecher, P; Sawlivich, W B; Costello, C E; Troxler, R F

    1988-05-15

    The complete amino acid sequence of a fatty acid-binding protein from human heart was determined by automated Edman degradation of CNBr, BNPS-skatole [3'-bromo-3-methyl-2-(2-nitrobenzenesulphenyl)indolenine], hydroxylamine, Staphylococcus aureus V8 proteinase, tryptic and chymotryptic peptides, and by digestion of the protein with carboxypeptidase A. The sequence of the blocked N-terminal tryptic peptide from citraconylated protein was determined by collisionally induced decomposition mass spectrometry. The protein contains 132 amino acid residues, is enriched with respect to threonine and lysine, lacks cysteine, has an acetylated valine residue at the N-terminus, and has an Mr of 14768 and an isoelectric point of 5.25. This protein contains two short internal repeated sequences from residues 48-54 and from residues 114-119 located within regions of predicted beta-structure and decreasing hydrophobicity. These short repeats are contained within two longer repeated regions from residues 48-60 and residues 114-125, which display 62% sequence similarity. These regions could accommodate the charged and uncharged moieties of long-chain fatty acids and may represent fatty acid-binding domains consistent with the finding that human heart fatty acid-binding protein binds 2 mol of oleate or palmitate/mol of protein. Detailed evidence for the amino acid sequences of the peptides has been deposited as Supplementary Publication SUP 50143 (23 pages) at the British Library Lending Division, Boston Spa, Yorkshire LS23 7BQ, U.K., from whom copies may be obtained as indicated in Biochem. J. (1988) 249, 5. PMID:3421901

  14. Computational methods in sequence and structure prediction

    NASA Astrophysics Data System (ADS)

    Lang, Caiyi

    This dissertation is organized into two parts. In the first part, we will discuss three computational methods for cis-regulatory element recognition in three different gene regulatory networks as the following: (a) Using a comprehensive "Phylogenetic Footprinting Comparison" method, we will investigate the promoter sequence structures of three enzymes (PAL, CHS and DFR) that catalyze sequential steps in the pathway from phenylalanine to anthocyanins in plants. Our result shows there exists a putative cis-regulatory element "AC(C/G)TAC(C)" in the upstream of these enzyme genes. We propose this cis-regulatory element to be responsible for the genetic regulation of these three enzymes and this element, might also be the binding site for MYB class transcription factor PAP1. (b) We will investigate the role of the Arabidopsis gene glutamate receptor 1.1 (AtGLR1.1) in C and N metabolism by utilizing the microarray data we obtained from AtGLR1.1 deficient lines (antiAtGLR1.1). We focus our investigation on the putatively co-regulated transcript profile of 876 genes we have collected in antiAtGLR1.1 lines. By (a) scanning the occurrence of several groups of known abscisic acid (ABA) related cisregulatory elements in the upstream regions of 876 Arabidopsis genes; and (b) exhaustive scanning of all possible 6-10 bps motif occurrence in the upstream regions of the same set of genes, we are able to make a quantative estimation on the enrichment level of each of the cis-regulatory element candidates. We finally conclude that one specific cis-regulatory element group, called "ABRE" elements, are statistically highly enriched within the 876-gene group as compared to their occurrence within the genome. (c) We will introduce a new general purpose algorithm, called "fuzzy REDUCE1", which we have developed recently for automated cis-regulatory element identification. In the second part, we will discuss our newly devised protein design framework. With this framework we have developed

  15. Hybridization and sequencing of nucleic acids using base pair mismatches

    DOEpatents

    Fodor, Stephen P. A.; Lipshutz, Robert J.; Huang, Xiaohua

    2001-01-01

    Devices and techniques for hybridization of nucleic acids and for determining the sequence of nucleic acids. Arrays of nucleic acids are formed by techniques, preferably high resolution, light-directed techniques. Positions of hybridization of a target nucleic acid are determined by, e.g., epifluorescence microscopy. Devices and techniques are proposed to determine the sequence of a target nucleic acid more efficiently and more quickly through such synthesis and detection techniques.

  16. Gene and translation initiation site prediction in metagenomic sequences

    SciTech Connect

    Hyatt, Philip Douglas; LoCascio, Philip F; Hauser, Loren John; Uberbacher, Edward C

    2012-01-01

    Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translation initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements.

  17. 77 FR 65537 - Requirements for Patent Applications Containing Nucleotide Sequence and/or Amino Acid Sequence...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-29

    ... Amino Acid Sequence Disclosures ACTION: Proposed collection; comment request. SUMMARY: The United States....'' SUPPLEMENTARY INFORMATION: I. Abstract Patent applications that contain nucleotide and/or amino acid sequence disclosures must include a copy of the sequence listing in accordance with the requirements in 37 CFR...

  18. Methods and compositions for efficient nucleic acid sequencing

    DOEpatents

    Drmanac, Radoje

    2002-01-01

    Disclosed are novel methods and compositions for rapid and highly efficient nucleic acid sequencing based upon hybridization with two sets of small oligonucleotide probes of known sequences. Extremely large nucleic acid molecules, including chromosomes and non-amplified RNA, may be sequenced without prior cloning or subcloning steps. The methods of the invention also solve various current problems associated with sequencing technology such as, for example, high noise to signal ratios and difficult discrimination, attaching many nucleic acid fragments to a surface, preparing many, longer or more complex probes and labelling more species.

  19. Methods and compositions for efficient nucleic acid sequencing

    DOEpatents

    Drmanac, Radoje

    2006-07-04

    Disclosed are novel methods and compositions for rapid and highly efficient nucleic acid sequencing based upon hybridization with two sets of small oligonucleotide probes of known sequences. Extremely large nucleic acid molecules, including chromosomes and non-amplified RNA, may be sequenced without prior cloning or subcloning steps. The methods of the invention also solve various current problems associated with sequencing technology such as, for example, high noise to signal ratios and difficult discrimination, attaching many nucleic acid fragments to a surface, preparing many, longer or more complex probes and labelling more species.

  20. Kit for detecting nucleic acid sequences using competitive hybridization probes

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    2001-01-01

    A kit is provided for detecting a target nucleic acid sequence in a sample, the kit comprising: a first hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a first portion of the target sequence, the first hybridization probe including a first complexing agent for forming a binding pair with a second complexing agent; and a second hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a second portion of the target sequence to which the first hybridization probe does not selectively hybridize, the second hybridization probe including a detectable marker; a third hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a first portion of the target sequence, the third hybridization probe including the same detectable marker as the second hybridization probe; and a fourth hybridization probe which includes a nucleic acid sequence that is sufficiently complementary to selectively hybridize to a second portion of the target sequence to which the third hybridization probe does not selectively hybridize, the fourth hybridization probe including the first complexing agent for forming a binding pair with the second complexing agent; wherein the first and second hybridization probes are capable of simultaneously hybridizing to the target sequence and the third and fourth hybridization probes are capable of simultaneously hybridizing to the target sequence, the detectable marker is not present on the first or fourth hybridization probes and the first, second, third, and fourth hybridization probes each include a competitive nucleic acid sequence which is sufficiently complementary to a third portion of the target sequence that the competitive sequences of the first, second, third, and fourth hybridization probes compete with each other to hybridize to the third portion of the

  1. Discriminative prediction of mammalian enhancers from DNA sequence

    PubMed Central

    Lee, Dongwon; Karchin, Rachel; Beer, Michael A.

    2011-01-01

    Accurately predicting regulatory sequences and enhancers in entire genomes is an important but difficult problem, especially in large vertebrate genomes. With the advent of ChIP-seq technology, experimental detection of genome-wide EP300/CREBBP bound regions provides a powerful platform to develop predictive tools for regulatory sequences and to study their sequence properties. Here, we develop a support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features. Moreover, we find that the predictive sequence features identified by the SVM classifier reveal biologically relevant sequence elements enriched in the enhancers, but we also identify other features that are significantly depleted in enhancers. The predictive sequence features are evolutionarily conserved and spatially clustered, providing further support of their functional significance. Although our SVM is trained on experimental data, we also predict novel enhancers and show that these putative enhancers are significantly enriched in both ChIP-seq signal and DNase I hypersensitivity signal in the mouse brain and are located near relevant genes. Finally, we present results of comparisons between other EP300/CREBBP data sets using our SVM and uncover sequence elements enriched and/or depleted in the different classes of enhancers. Many of these sequence features play a role in specifying tissue-specific or developmental-stage-specific enhancer activity, but our results indicate that some features operate in a general or tissue-independent manner. In addition to providing a high confidence list of enhancer targets for subsequent experimental investigation, these results contribute to our understanding of the general sequence structure of vertebrate enhancers. PMID:21875935

  2. tax and rex Sequences of bovine leukaemia virus from globally diverse isolates: rex amino acid sequence more variable than tax.

    PubMed

    McGirr, K M; Buehring, G C

    2005-02-01

    Bovine leukaemia virus (BLV) is an important agricultural problem with high costs to the dairy industry. Here, we examine the variation of the tax and rex genes of BLV. The tax and rex genes share 420 bases and have overlapping reading frames. The tax gene encodes a protein that functions as a transactivator of the BLV promoter, is required for viral replication, acts on cellular promoters, and is responsible for oncogenesis. The rex facilitates the export of viral mRNAs from the nucleus and regulates transcription. We have sequenced five new isolates of the tax/rex gene. We examined the five new and three previously published tax/rex DNA and predicted amino acid sequences of BLV isolates from cattle in representative regions worldwide. The highest variation among nucleic acid sequences for tax and rex was 7% and 5%, respectively; among predicted amino acid sequences for Tax and Rex, 9% and 11%, respectively. Significantly more nucleotide changes resulted in predicted amino acid changes in the rex gene than in the tax gene (P < or = 0.0006). This variability is higher than previously reported for any region of the viral genome. This research may also have implications for the development of Tax-based vaccines. PMID:15702995

  3. Selection of sequence variants to improve dairy cattle genomic predictions

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genomic prediction reliabilities improved when adding selected sequence variants from run 5 of the 1,000 bull genomes project. High density (HD) imputed genotypes for 26,970 progeny tested Holstein bulls were combined with sequence variants for 444 Holstein animals. The first test included 481,904 c...

  4. Solid phase sequencing of double-stranded nucleic acids

    DOEpatents

    Fu, Dong-Jing; Cantor, Charles R.; Koster, Hubert; Smith, Cassandra L.

    2002-01-01

    This invention relates to methods for detecting and sequencing of target double-stranded nucleic acid sequences, to nucleic acid probes and arrays of probes useful in these methods, and to kits and systems which contain these probes. Useful methods involve hybridizing the nucleic acids or nucleic acids which represent complementary or homologous sequences of the target to an array of nucleic acid probes. These probe comprise a single-stranded portion, an optional double-stranded portion and a variable sequence within the single-stranded portion. The molecular weights of the hybridized nucleic acids of the set can be determined by mass spectroscopy, and the sequence of the target determined from the molecular weights of the fragments. Nucleic acids whose sequences can be determined include nucleic acids in biological samples such as patient biopsies and environmental samples. Probes may be fixed to a solid support such as a hybridization chip to facilitate automated determination of molecular weights and identification of the target sequence.

  5. Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions

    PubMed Central

    Druet, T; Macleod, I M; Hayes, B J

    2014-01-01

    Genomic prediction from whole-genome sequence data is attractive, as the accuracy of genomic prediction is no longer bounded by extent of linkage disequilibrium between DNA markers and causal mutations affecting the trait, given the causal mutations are in the data set. A cost-effective strategy could be to sequence a small proportion of the population, and impute sequence data to the rest of the reference population. Here, we describe strategies for selecting individuals for sequencing, based on either pedigree relationships or haplotype diversity. Performance of these strategies (number of variants detected and accuracy of imputation) were evaluated in sequence data simulated through a real Belgian Blue cattle pedigree. A strategy (AHAP), which selected a subset of individuals for sequencing that maximized the number of unique haplotypes (from single-nucleotide polymorphism panel data) sequenced gave good performance across a range of variant minor allele frequencies. We then investigated the optimum number of individuals to sequence by fold coverage given a maximum total sequencing effort. At 600 total fold coverage (x 600), the optimum strategy was to sequence 75 individuals at eightfold coverage. Finally, we investigated the accuracy of genomic predictions that could be achieved. The advantage of using imputed sequence data compared with dense SNP array genotypes was highly dependent on the allele frequency spectrum of the causative mutations affecting the trait. When this followed a neutral distribution, the advantage of the imputed sequence data was small; however, when the causal mutations all had low minor allele frequencies, using the sequence data improved the accuracy of genomic prediction by up to 30%. PMID:23549338

  6. Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets.

    PubMed

    Melo, Francisco; Marti-Renom, Marc A

    2006-06-01

    Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs. PMID:16506243

  7. Matrix genes of measles virus and canine distemper virus: cloning, nucleotide sequences, and deduced amino acid sequences.

    PubMed Central

    Bellini, W J; Englund, G; Richardson, C D; Rozenblatt, S; Lazzarini, R A

    1986-01-01

    The nucleotide sequences encoding the matrix (M) proteins of measles virus (MV) and canine distemper virus (CDV) were determined from cDNA clones containing these genes in their entirety. In both cases, single open reading frames specifying basic proteins of 335 amino acid residues were predicted from the nucleotide sequences. Both viral messages were composed of approximately 1,450 nucleotides and contained 400 nucleotides of presumptive noncoding sequences at their respective 3' ends. MV and CDV M-protein-coding regions were 67% homologous at the nucleotide level and 76% homologous at the amino acid level. Only chance homology was observed in the 400-nucleotide trailer sequences. Comparisons of the M protein sequences of MV and CDV with the sequence reported for Sendai virus (B. M. Blumberg, K. Rose, M. G. Simona, L. Roux, C. Giorgi, and D. Kolakofsky, J. Virol. 52:656-663; Y. Hidaka, T. Kanda, K. Iwasaki, A. Nomoto, T. Shioda, and H. Shibuta, Nucleic Acids Res. 12:7965-7973) indicated the greatest homology among these M proteins in the carboxyterminal third of the molecule. Secondary-structure analyses of this shared region indicated a structurally conserved, hydrophobic sequence which possibly interacted with the lipid bilayer. Images PMID:3754588

  8. From Artificial Amino Acids to Sequence-Defined Targeted Oligoaminoamides.

    PubMed

    Morys, Stephan; Wagner, Ernst; Lächelt, Ulrich

    2016-01-01

    Artificial oligoamino acids with appropriate protecting groups can be used for the sequential assembly of oligoaminoamides on solid-phase. With the help of these oligoamino acids multifunctional nucleic acid (NA) carriers can be designed and produced in highly defined topologies. Here we describe the synthesis of the artificial oligoamino acid Fmoc-Stp(Boc3)-OH, the subsequent assembly into sequence-defined oligomers and the formulation of tumor-targeted plasmid DNA (pDNA) polyplexes. PMID:27436323

  9. Cerebellar sequencing: a trick for predicting the future.

    PubMed

    Leggio, M; Molinari, M

    2015-02-01

    "Looking into the future" well depicts one of the most significant concepts in cognitive neuroscience: the brain is constantly predicting future events. Such directedness toward the future has been recognized to be relevant to and beneficial for many aspects of information processing in humans, such as perception, motor and cognitive control, decision-making, theory of mind, and other cognitive processes. Because one of the most adaptive characteristics of the brain is to correct errors, the ability to look into the future represents the best chance to avoid repeating errors. Within the structures that constitute the "predictive brain," the cerebellum has been proposed to have a central function, based on its ability to generate internal models. We suggested that "sequence detection" is the operational mode of the cerebellum in predictive processing. According to this hypothesis, the cerebellum detects and simulates repetitive patterns of temporally or spatially structured events and generates internal models that can be used to make predictions. Consequently, we demonstrate that the cerebellum recognizes serial events as a sequence, detects a sequence violation, and successfully reconstructs the correct sequence of events. Thus, we hypothesize that pattern detection and prediction and processing of anticipation are cerebellum-specific functions within the brain and that the sequence detection hypothesis links the multifarious impairments that are reported in patients with cerebellar damage. We propose that this cerebellar operational mode can advance our understanding of the pathophysiological mechanisms in various clinical conditions, such as schizophrenia and autism. PMID:25331541

  10. Detecting frame shifts by amino acid sequence comparison.

    PubMed

    Claverie, J M

    1993-12-20

    Various amino acid substitution scoring matrices are used in conjunction with local alignments programs to detect regions of similarity and infer potential common ancestry between proteins. The usual scoring schemes derive from the implicit hypothesis that related proteins evolve from a common ancestor by the accumulation of point mutations and that amino acids tend to be progressively substituted by others with similar properties. However, other frequent single mutation events, like nucleotide insertion or deletion and gene inversion, change the translation reading frame and cause previously encoded amino acid sequences to become unrecognizable at once. Here, I derive five new types of scoring matrix, each capable of detecting a specific frame shift (deletion, insertion and inversion in 3 frames) and use them with a regular local alignments program to detect amino acid sequences that may have derived from alternative reading frames of the same nucleotide sequence. Frame shifts are inferred from the sole comparison of the protein sequences. The five scoring matrices were used with the BLASTP program to compare all the protein sequences in the Swissprot database. Surprisingly, the searches revealed hundreds of highly significant frame shift matches, of which many are likely to represent sequencing errors. Others provide some evidence that frame shift mutations might be used in protein evolution as a way to create new amino acid sequences from pre-existing coding regions. PMID:7903399

  11. Segments of amino acid sequence similarity in beta-amylases.

    PubMed

    Friedberg, F; Rhodes, C

    1988-01-01

    In alpha-amylases from animals, plants and bacteria and in beta-amylases from plants and bacteria a number of segments exhibit amino acid sequence similarity specific to the alpha or to the beta type, respectively. In the case of the beta-amylases the similar sequence regions are extensive and they are disrupted only by short interspersed dissimilar regions. Close to the C terminus, however, no such sequence similarity exist. PMID:2464171

  12. Learned spatiotemporal sequence recognition and prediction in primary visual cortex

    PubMed Central

    Gavornik, Jeffrey P.; Bear, Mark F.

    2014-01-01

    Learning to recognize and predict temporal sequences is fundamental to sensory perception, and is impaired in several neuropsychiatric disorders, but little is known about where and how this occurs in the brain. We discovered that repeated presentations of a visual sequence over a course of days causes evoked response potentiation in mouse V1 that is highly specific for stimulus order and timing. Remarkably, after V1 is trained to recognize a sequence, cortical activity regenerates the full sequence even when individual stimulus elements are omitted. This novel neurophysiological report of sequence learning advances the understanding of how the brain makes “intelligent guesses” based on limited information to form visual percepts and suggests that it is possible to study the mechanistic basis of this high–level cognitive ability by studying low–level sensory systems. PMID:24657967

  13. Prediction of fine-tuned promoter activity from DNA sequence

    PubMed Central

    Siwo, Geoffrey; Rider, Andrew; Tan, Asako; Pinapati, Richard; Emrich, Scott; Chawla, Nitesh; Ferdig, Michael

    2016-01-01

    The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring

  14. Accurate prediction of protein–protein interactions from sequence alignments using a Bayesian method

    PubMed Central

    Burger, Lukas; van Nimwegen, Erik

    2008-01-01

    Accurate and large-scale prediction of protein–protein interactions directly from amino-acid sequences is one of the great challenges in computational biology. Here we present a new Bayesian network method that predicts interaction partners using only multiple alignments of amino-acid sequences of interacting protein domains, without tunable parameters, and without the need for any training examples. We first apply the method to bacterial two-component systems and comprehensively reconstruct two-component signaling networks across all sequenced bacteria. Comparisons of our predictions with known interactions show that our method infers interaction partners genome-wide with high accuracy. To demonstrate the general applicability of our method we show that it also accurately predicts interaction partners in a recent dataset of polyketide synthases. Analysis of the predicted genome-wide two-component signaling networks shows that cognates (interacting kinase/regulator pairs, which lie adjacent on the genome) and orphans (which lie isolated) form two relatively independent components of the signaling network in each genome. In addition, while most genes are predicted to have only a small number of interaction partners, we find that 10% of orphans form a separate class of ‘hub' nodes that distribute and integrate signals to and from up to tens of different interaction partners. PMID:18277381

  15. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... acids are not intended to be embraced by this definition. Any amino acid sequence that contains post-translationally modified amino acids may be described as the amino acid sequence that is initially translated... sequence of four or more amino acids or an unbranched sequence of ten or more nucleotides....

  16. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... acids are not intended to be embraced by this definition. Any amino acid sequence that contains post-translationally modified amino acids may be described as the amino acid sequence that is initially translated... sequence of four or more amino acids or an unbranched sequence of ten or more nucleotides....

  17. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... acids are not intended to be embraced by this definition. Any amino acid sequence that contains post-translationally modified amino acids may be described as the amino acid sequence that is initially translated... sequence of four or more amino acids or an unbranched sequence of ten or more nucleotides....

  18. Local predictability in biological sequences, algorithm and applications.

    PubMed

    Lebbe, J; Vignes, R

    1993-01-01

    The goal of this paper is to propose an algorithm based on the k nearest neighbours to compute a local predictability measure in biological sequences. Some ideas about the usefulness of this measure are discussed on the basis of preliminary experimentations. PMID:8347724

  19. Efficient prediction methods for selecting effective siRNA sequences.

    PubMed

    Takasaki, Shigeru

    2010-02-01

    Although short interfering RNA (siRNA) has been widely used for studying gene functions in mammalian cells, its gene silencing efficacy varies markedly and there are only a few consistencies among the recently reported design rules/guidelines for selecting siRNA sequences effective for mammalian genes. Another shortcoming of the previously reported methods is that they cannot estimate the probability that a candidate sequence will silence the target gene. This paper first reviewed the recently reported siRNA design guidelines and clarified the problems concerning the guidelines. It then proposed two prediction methods-Radial Basis Function (RBF) network and decision tree learning-and their combined method for selecting effective siRNA target sequences from many possible candidate sequences. They are quite different from the previous score-based siRNA design techniques and can predict the probability that a candidate siRNA sequence will be effective. The methods imply high estimation accuracy for selecting candidate siRNA sequences. PMID:20022002

  20. Bayesian classification for promoter prediction in human DNA sequences

    NASA Astrophysics Data System (ADS)

    Bercher, J.-F.; Jardin, P.; Duriez, B.

    2006-11-01

    Many Computational methods are yet available for data retrieval and analysis of genomic sequences, but some functional sites are difficult to characterize. In this work, we examine the problem of promoter localization in human DNA sequences. Promoters are regulatory regions that governs the expression of genes, and their prediction is reputed difficult, so that this issue is still open. We present the Chaos Game representation (CGR) of DNA sequences which has many interesting properties, and the notion of `genomic signature' that proved relevant in phylogeny applications. Based on this notion, we develop a (naïve) bayesian classifier, evaluate its performances, and show that its adaptive implementation enable to reveal or assess core-promoter positions along a DNA sequence.

  1. A method to find palindromes in nucleic acid sequences.

    PubMed

    Anjana, Ramnath; Shankar, Mani; Vaishnavi, Marthandan Kirti; Sekar, Kanagaraj

    2013-01-01

    Various types of sequences in the human genome are known to play important roles in different aspects of genomic functioning. Among these sequences, palindromic nucleic acid sequences are one such type that have been studied in detail and found to influence a wide variety of genomic characteristics. For a nucleotide sequence to be considered as a palindrome, its complementary strand must read the same in the opposite direction. For example, both the strands i.e the strand going from 5' to 3' and its complementary strand from 3' to 5' must be complementary. A typical nucleotide palindromic sequence would be TATA (5' to 3') and its complimentary sequence from 3' to 5' would be ATAT. Thus, a new method has been developed using dynamic programming to fetch the palindromic nucleic acid sequences. The new method uses less memory and thereby it increases the overall speed and efficiency. The proposed method has been tested using the bacterial (3891 KB bases) and human chromosomal sequences (Chr-18: 74366 kb and Chr-Y: 25554 kb) and the computation time for finding the palindromic sequences is in milli seconds. PMID:23515654

  2. BETTY: prediction of beta-strand type from sequence.

    PubMed

    Zimmermann, Olav; Wang, Longhui; Hansmann, Ulrich H E

    2007-01-01

    Most secondary structure prediction programs do not distinguish between parallel and antiparallel beta-sheets. However, such knowledge would constrain the available topologies of a protein significantly, and therefore aid existing fold recognition algorithms. For this reason, we propose a technique which, in combination with existing secondary structure programs such as PSIPRED, allows one to distinguish between parallel and antiparallel beta-sheets. We propose the use of a support vector machine (SVM) procedure, BETTY, to predict parallel and antiparallel sheets from sequence. We found that there is a strong signal difference in the sequence profiles which SVMs can efficiently extract. With strand type assignment accuracies of 90.7% and 83.3% for antiparallel and parallel strands, respectively, our method adds considerably to existing information on current 3-class secondary structure predictions. BETTY has been implemented as an online service which academic researchers can access from our website http://www.fz-juelich.de/nic/cbb/service/service.php. PMID:18391242

  3. Predicting terrorist actions using sequence learning and past events

    NASA Astrophysics Data System (ADS)

    Ruda, Harald; Das, Subrata K.; Zacharias, Greg L.

    2003-09-01

    This paper describes the application of sequence learning to the domain of terrorist group actions. The goal is to make accurate predictions of future events based on learning from past history. The past history of the group is represented as a sequence of events. Well-established sequence learning approaches are used to generate temporal rules from the event sequence. In order to represent all the possible events involving a terrorist group activities, an event taxonomy has been created that organizes the events into a hierarchical structure. The event taxonomy is applied when events are extracted, and the hierarchical form of the taxonomy is especially useful when only scant information is available about an event. The taxonomy can also be used to generate temporal rules at various levels of abstraction. The generated temporal rules are used to generate predictions that can be compared to actual events for evaluation. The approach was tested on events collected for a four-year period from relevant newspaper articles and other open-source literature. Temporal rules were generated based on the first half of the data, and predictions were generated for the second half of the data. Evaluation yielded a high hit rate and a moderate false-alarm rate.

  4. Amino acid sequence repertoire of the bacterial proteome and the occurrence of untranslatable sequences.

    PubMed

    Navon, Sharon Penias; Kornberg, Guy; Chen, Jin; Schwartzman, Tali; Tsai, Albert; Puglisi, Elisabetta Viani; Puglisi, Joseph D; Adir, Noam

    2016-06-28

    Bioinformatic analysis of Escherichia coli proteomes revealed that all possible amino acid triplet sequences occur at their expected frequencies, with four exceptions. Two of the four underrepresented sequences (URSs) were shown to interfere with translation in vivo and in vitro. Enlarging the URS by a single amino acid resulted in increased translational inhibition. Single-molecule methods revealed stalling of translation at the entrance of the peptide exit tunnel of the ribosome, adjacent to ribosomal nucleotides A2062 and U2585. Interaction with these same ribosomal residues is involved in regulation of translation by longer, naturally occurring protein sequences. The E. coli exit tunnel has evidently evolved to minimize interaction with the exit tunnel and maximize the sequence diversity of the proteome, although allowing some interactions for regulatory purposes. Bioinformatic analysis of the human proteome revealed no underrepresented triplet sequences, possibly reflecting an absence of regulation by interaction with the exit tunnel. PMID:27307442

  5. PredyFlexy: flexibility and local structure prediction from sequence

    PubMed Central

    de Brevern, Alexandre G.; Bornot, Aurélie; Craveur, Pierrick; Etchebest, Catherine; Gelly, Jean-Christophe

    2012-01-01

    Protein structures are necessary for understanding protein function at a molecular level. Dynamics and flexibility of protein structures are also key elements of protein function. So, we have proposed to look at protein flexibility using novel methods: (i) using a structural alphabet and (ii) combining classical X-ray B-factor data and molecular dynamics simulations. First, we established a library composed of structural prototypes (LSPs) to describe protein structure by a limited set of recurring local structures. We developed a prediction method that proposes structural candidates in terms of LSPs and predict protein flexibility along a given sequence. Second, we examine flexibility according to two different descriptors: X-ray B-factors considered as good indicators of flexibility and the root mean square fluctuations, based on molecular dynamics simulations. We then define three flexibility classes and propose a method based on the LSP prediction method for predicting flexibility along the sequence. This method does not resort to sophisticate learning of flexibility but predicts flexibility from average flexibility of predicted local structures. The method is implemented in PredyFlexy web server. Results are similar to those obtained with the most recent, cutting-edge methods based on direct learning of flexibility data conducted with sophisticated algorithms. PredyFlexy can be accessed at http://www.dsimb.inserm.fr/dsimb_tools/predyflexy/. PMID:22689641

  6. Draft genome sequence of the docosahexaenoic acid producing thraustochytrid Aurantiochytrium sp. T66.

    PubMed

    Liu, Bin; Ertesvåg, Helga; Aasen, Inga Marie; Vadstein, Olav; Brautaset, Trygve; Heggeset, Tonje Marita Bjerkan

    2016-06-01

    Thraustochytrids are unicellular, marine protists, and there is a growing industrial interest in these organisms, particularly because some species, including strains belonging to the genus Aurantiochytrium, accumulate high levels of docosahexaenoic acid (DHA). Here, we report the draft genome sequence of Aurantiochytrium sp. T66 (ATCC PRA-276), with a size of 43 Mbp, and 11,683 predicted protein-coding sequences. The data has been deposited at DDBJ/EMBL/Genbank under the accession LNGJ00000000. The genome sequence will contribute new insight into DHA biosynthesis and regulation, providing a basis for metabolic engineering of thraustochytrids. PMID:27222814

  7. On Quantum Algorithm for Multiple Alignment of Amino Acid Sequences

    NASA Astrophysics Data System (ADS)

    Iriyama, Satoshi; Ohya, Masanori

    2009-02-01

    The alignment of genome sequences or amino acid sequences is one of fundamental operations for the study of life. Usual computational complexity for the multiple alignment of N sequences with common length L by dynamic programming is O(LN). This alignment is considered as one of the NP problems, so that it is desirable to find a nice algorithm of the multiple alignment. Thus in this paper we propose the quantum algorithm for the multiple alignment based on the works12,1,2 in which the NP complete problem was shown to be the P problem by means of quantum algorithm and chaos information dynamics.

  8. Learning to predict: Exposure to temporal sequences facilitates prediction of future events

    PubMed Central

    Baker, Rosalind; Dexter, Matthew; Hardwicke, Tom E.; Goldstone, Aimee; Kourtzi, Zoe

    2014-01-01

    Previous experience is thought to facilitate our ability to extract spatial and temporal regularities from cluttered scenes. However, little is known about how we may use this knowledge to predict future events. Here we test whether exposure to temporal sequences facilitates the visual recognition of upcoming stimuli. We presented observers with a sequence of leftwards and rightwards oriented gratings that was interrupted by a test stimulus. Observers were asked to indicate whether the orientation of the test stimulus matched their expectation based on the preceding sequence. Our results demonstrate that exposure to temporal sequences without feedback facilitates our ability to predict an upcoming stimulus. In particular, observers’ performance improved following exposure to structured but not random sequences. Improved performance lasted for a prolonged period and generalized to untrained stimulus orientations rather than sequences of different global structure, suggesting that observers acquire knowledge of the sequence structure rather than its items. Further, this learning was compromised when observers performed a dual task resulting in increased attentional load. These findings suggest that exposure to temporal regularities in a scene allows us to accumulate knowledge about its global structure and predict future events. PMID:24231115

  9. Prebiotically plausible mechanisms increase compositional diversity of nucleic acid sequences

    PubMed Central

    Derr, Julien; Manapat, Michael L.; Rajamani, Sudha; Leu, Kevin; Xulvi-Brunet, Ramon; Joseph, Isaac; Nowak, Martin A.; Chen, Irene A.

    2012-01-01

    During the origin of life, the biological information of nucleic acid polymers must have increased to encode functional molecules (the RNA world). Ribozymes tend to be compositionally unbiased, as is the vast majority of possible sequence space. However, ribonucleotides vary greatly in synthetic yield, reactivity and degradation rate, and their non-enzymatic polymerization results in compositionally biased sequences. While natural selection could lead to complex sequences, molecules with some activity are required to begin this process. Was the emergence of compositionally diverse sequences a matter of chance, or could prebiotically plausible reactions counter chemical biases to increase the probability of finding a ribozyme? Our in silico simulations using a two-letter alphabet show that template-directed ligation and high concatenation rates counter compositional bias and shift the pool toward longer sequences, permitting greater exploration of sequence space and stable folding. We verified experimentally that unbiased DNA sequences are more efficient templates for ligation, thus increasing the compositional diversity of the pool. Our work suggests that prebiotically plausible chemical mechanisms of nucleic acid polymerization and ligation could predispose toward a diverse pool of longer, potentially structured molecules. Such mechanisms could have set the stage for the appearance of functional activity very early in the emergence of life. PMID:22319215

  10. The amino-acid sequence of kangaroo pancreatic ribonuclease.

    PubMed

    Gaastra, W; Welling, G W; Beintema, J J

    1978-05-01

    Red kangaroo (Macropus rufus) ribonuclease was isolated from pancreatic tissue by affinity chromatography. The amino acid sequence was determined by automatic sequencing of overlapping large fragments and by analysis of shorter peptides obtained by digestion with a number of proteolytic enzymes. The polypeptide chain consists of 122 amino acid residues. Compared to other ribonucleases, the N-terminal residue and residue 114 are deleted. In other pancreatic ribonucleases position 114 is occupied by a cis proline residue in an external loop at the surface of the molecule. Other remarkable substitutions are the presence of a tyrosine residue at position 123 instead of a serine which forms a hydrogen bond with the pyrimidine ring of a nucleotide substrate, and a number of hydrophobichydrophilic interchanges in the sequence 51-55, which forms part of an alpha-helix in bovine ribonuclease and exhibits few substitutions in the placental mammals. Kangaroo ribonuclease contains no carbohydrate, although the enzyme possesses a recognition site for carbohydrate attachment in the sequence Asn-Val-Thr (62-64). The enzyme differs at about 35-40% of the positions from all other mammalian pancreatic ribonucleases sequenced to date, which is in agreement with the early divergence between the marsupials and the placental mammals. From fragmentary data a tentative sequence of red-necked wallaby (Macropus rufogriseus) pancreatic ribonuclease has been derived. Eight differences with the kangaroo sequence were found. PMID:658039

  11. Prediction of ribosome footprint profile shapes from transcript sequences

    PubMed Central

    Liu, Tzu-Yu; Song, Yun S.

    2016-01-01

    Motivation: Ribosome profiling is a useful technique for studying translational dynamics and quantifying protein synthesis. Applications of this technique have shown that ribosomes are not uniformly distributed along mRNA transcripts. Understanding how each transcript-specific distribution arises is important for unraveling the translation mechanism. Results: Here, we apply kernel smoothing to construct predictive features and build a sparse model to predict the shape of ribosome footprint profiles from transcript sequences alone. Our results on Saccharomyces cerevisiae data show that the marginal ribosome densities can be predicted with high accuracy. The proposed novel method has a wide range of applications, including inferring isoform-specific ribosome footprints, designing transcripts with fast translation speeds and discovering unknown modulation during translation. Availability and implementation: A software package called riboShape is freely available at https://sourceforge.net/projects/riboshape Contact: yss@berkeley.edu PMID:27307616

  12. Exploiting single-molecule transcript sequencing for eukaryotic gene prediction.

    PubMed

    Minoche, André E; Dohm, Juliane C; Schneider, Jessica; Holtgräwe, Daniela; Viehöver, Prisca; Montfort, Magda; Sörensen, Thomas Rosleff; Weisshaar, Bernd; Himmelbauer, Heinz

    2015-01-01

    We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes. PMID:26328666

  13. Amino acid sequence of Salmonella typhimurium branched-chain amino acid aminotransferase.

    PubMed

    Feild, M J; Nguyen, D C; Armstrong, F B

    1989-06-13

    The complete amino acid sequence of the subunit of branched-chain amino acid aminotransferase (transaminase B, EC 2.6.1.42) of Salmonella typhimurium was determined. An Escherichia coli recombinant containing the ilvGEDAY gene cluster of Salmonella was used as the source of the hexameric enzyme. The peptide fragments used for sequencing were generated by treatment with trypsin, Staphylococcus aureus V8 protease, endoproteinase Lys-C, and cyanogen bromide. The enzyme subunit contains 308 residues and has a molecular weight of 33,920. To determine the coenzyme-binding site, the pyridoxal 5-phosphate containing enzyme was treated with tritiated sodium borohydride prior to trypsin digestion. Peptide map comparisons with an apoenzyme tryptic digest and monitoring radioactivity incorporation allowed identification of the pyridoxylated peptide, which was then isolated and sequenced. The coenzyme-binding site is the lysyl residue at position 159. The amino acid sequence of Salmonella transaminase B is 97.4% identical with that of Escherichia coli, differing in only eight amino acid positions. Sequence comparisons of transaminase B to other known aminotransferase sequences revealed limited sequence similarity (24-33%) when conserved amino acid substitutions are allowed and alignments were forced to occur on the coenzyme-binding site. PMID:2669973

  14. Improved therapy-success prediction with GSS estimated from clinical HIV-1 sequences

    PubMed Central

    Pironti, Alejandro; Pfeifer, Nico; Kaiser, Rolf; Walter, Hauke; Lengauer, Thomas

    2014-01-01

    Introduction Rules-based HIV-1 drug-resistance interpretation (DRI) systems disregard many amino-acid positions of the drug's target protein. The aims of this study are (1) the development of a drug-resistance interpretation system that is based on HIV-1 sequences from clinical practice rather than hard-to-get phenotypes, and (2) the assessment of the benefit of taking all available amino-acid positions into account for DRI. Materials and Methods A dataset containing 34,934 therapy-naïve and 30,520 drug-exposed HIV-1 pol sequences with treatment history was extracted from the EuResist database and the Los Alamos National Laboratory database. 2,550 therapy-change-episode baseline sequences (TCEB) were assigned to test set A. Test set B contains 1,084 TCEB from the HIVdb TCE repository. Sequences from patients absent in the test sets were used to train three linear support vector machines to produce scores that predict drug exposure pertaining to each of 20 antiretrovirals: the first one uses the full amino-acid sequences (DEfull), the second one only considers IAS drug-resistance positions (DEonlyIAS), and the third one disregards IAS drug-resistance positions (DEnoIAS). For performance comparison, test sets A and B were evaluated with DEfull, DEnoIAS, DEonlyIAS, geno2pheno[resistance], HIVdb, ANRS, HIV-GRADE, and REGA. Clinically-validated cut-offs were used to convert the continuous output of the first four methods into susceptible-intermediate-resistant (SIR) predictions. With each method, a genetic susceptibility score (GSS) was calculated for each therapy episode in each test set by converting the SIR prediction for its compounds to integer: S=2, I=1, and R=0. The GSS were used to predict therapy success as defined by the EuResist standard datum definition. Statistical significance was assessed using a Wilcoxon signed-rank test. Results A comparison of the therapy-success prediction performances among the different interpretation systems for test set A can be

  15. QGRS Mapper: a web-based server for predicting G-quadruplexes in nucleotide sequences

    PubMed Central

    Kikin, Oleg; D'Antonio, Lawrence; Bagga, Paramjeet S

    2006-01-01

    The quadruplex structures formed by guanine-rich nucleic acid sequences have received significant attention recently because of growing evidence for their role in important biological processes and as therapeutic targets. G-quadruplex DNA has been suggested to regulate DNA replication and may control cellular proliferation. Sequences capable of forming G-quadruplexes in the RNA have been shown to play significant roles in regulation of polyadenylation and splicing events in mammalian transcripts. Whether quadruplex structure directly plays a role in regulating RNA processing requires investigation. Computational approaches to study G-quadruplexes allow detailed analysis of mammalian genomes. There are no known easily accessible user-friendly tools that can compute G-quadruplexes in the nucleotide sequences. We have developed a web-based server, QGRS Mapper, that predicts quadruplex forming G-rich sequences (QGRS) in nucleotide sequences. It is a user-friendly application that provides many options for defining and studying G-quadruplexes. It performs analysis of the user provided genomic sequences, e.g. promoter and telomeric regions, as well as RNA sequences. It is also useful for predicting G-quadruplex structures in oligonucleotides. The program provides options to search and retrieve desired gene/nucleotide sequence entries from NCBI databases for mapping G-quadruplexes in the context of RNA processing sites. This feature is very useful for investigating the functional relevance of G-quadruplex structure, in particular its role in regulating the gene expression by alternative processing. In addition to providing data on composition and locations of QGRS relative to the processing sites in the pre-mRNA sequence, QGRS Mapper features interactive graphic representation of the data. The user can also use the graphics module to visualize QGRS distribution patterns among all the alternative RNA products of a gene simultaneously on a single screen. QGRS Mapper can be

  16. Amino acid sequence of bovine heart coupling factor 6.

    PubMed Central

    Fang, J K; Jacobs, J W; Kanner, B I; Racker, E; Bradshaw, R A

    1984-01-01

    The amino acid sequence of bovine heart mitochondrial coupling factor 6 (F6) has been determined by automated Edman degradation of the whole protein and derived peptides. Preparations based on heat precipitation and ethanol extraction showed allotypic variation at three positions while material further purified by HPLC yielded only one sequence that also differed by a Phe-Thr replacement at residue 62. The mature protein contains 76 amino acids with a calculated molecular weight of 9006 and a pI of approximately equal to 5, in good agreement with experimentally measured values. The charged amino acids are mainly clustered at the termini and in one section in the middle; these three polar segments are separated by two segments relatively rich in nonpolar residues. Chou-Fasman analysis suggests three stretches of alpha-helix coinciding (or within) the high-charge-density sequences with a single beta-turn at the first polar-nonpolar junction. Comparison of the F6 sequence with those of other proteins did not reveal any homologous structures. PMID:6149548

  17. Sequences Of Amino Acids For Human Serum Albumin

    NASA Technical Reports Server (NTRS)

    Carter, Daniel C.

    1992-01-01

    Sequences of amino acids defined for use in making polypeptides one-third to one-sixth as large as parent human serum albumin molecule. Smaller, chemically stable peptides have diverse applications including service as artificial human serum and as active components of biosensors and chromatographic matrices. In applications involving production of artificial sera from new sequences, little or no concern about viral contaminants. Smaller genetically engineered polypeptides more easily expressed and produced in large quantities, making commercial isolation and production more feasible and profitable.

  18. Prediction and Validation of Native and Engineered Cas9 Guide Sequences.

    PubMed

    Briner, Alexandra E; Henriksen, Emily D; Barrangou, Rodolphe

    2016-01-01

    Cas9-based technologies rely on native elements of Type II CRISPR-Cas bacterial immune systems, including the trans-activating CRISPR RNA (tracrRNA), CRISPR RNA (crRNA), Cas9 protein, and protospacer-adjacent motif (PAM). The tracrRNA and crRNA form an RNA duplex that guides the Cas9 endonuclease to complementary nucleic acid sequences. Mechanistically, Cas9 initiates interactions by binding to the target PAM sequence and interrogating the target DNA in a 3'-to-5' manner. Complementarity between the guide RNA and the target DNA is key. In natural systems, precise cleavage occurs when the target DNA sequence contains a PAM flanking a sequence homologous to the crRNA spacer sequence. Currently, the majority of commercial Cas9-based genome-editing tools are derived from the Type II CRISPR-Cas system of Streptococcus pyogenes However, a diverse set of Type II CRISPR-Cas systems exist in nature that are potentially valuable for genome engineering applications. Exploitation of these systems requires prediction and validation of both native and engineered dual and single guide RNAs to drive Cas9 functionality. Here, we discuss how to identify the elements of these immune systems to develop next-generation Cas9-based genome-editing tools. We first discuss how to predict tracrRNA sequences and suggest a method for designing single guide RNAs containing only critical structural modules. We then outline how to predict the PAM sequence, which is crucial for determining potential targets for Cas9. Finally, validation of the system elements through transcriptome analysis and interference assays is essential for developing next-generation Cas9-based genome-editing tools. PMID:27371591

  19. IsoFinder: computational prediction of isochores in genome sequences.

    PubMed

    Oliver, José L; Carpena, Pedro; Hackenberg, Michael; Bernaola-Galván, Pedro

    2004-07-01

    Isochores are long genome segments homogeneous in G+C. Here, we describe an algorithm (IsoFinder) running on the web (http://bioinfo2.ugr.es/IsoF/isofinder.html) able to predict isochores at the sequence level. We move a sliding pointer from left to right along the DNA sequence. At each position of the pointer, we compute the mean G+C values to the left and to the right of the pointer. We then determine the position of the pointer for which the difference between left and right mean values (as measured by the t-statistic) reaches its maximum. Next, we determine the statistical significance of this potential cutting point, after filtering out short-scale heterogeneities below 3 kb by applying a coarse-graining technique. Finally, the program checks whether this significance exceeds a probability threshold. If so, the sequence is cut at this point into two subsequences; otherwise, the sequence remains undivided. The procedure continues recursively for each of the two resulting subsequences created by each cut. This leads to the decomposition of a chromosome sequence into long homogeneous genome regions (LHGRs) with well-defined mean G+C contents, each significantly different from the G+C contents of the adjacent LHGRs. Most LHGRs can be identified with Bernardi's isochores, given their correlation with biological features such as gene density, SINE and LINE (short, long interspersed repetitive elements) densities, recombination rate or single nucleotide polymorphism variability. The resulting isochore maps are available at our web site (http://bioinfo2.ugr.es/isochores/), and also at the UCSC Genome Browser (http://genome.cse.ucsc.edu/). PMID:15215396

  20. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder

    PubMed Central

    Lorenzo, J. Ramiro; Alonso, Leonardo G.; Sánchez, Ignacio E.

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage “Protein and nucleic acid structure and sequence analysis”. PMID:26674530

  1. Nanopores and nucleic acids: prospects for ultrarapid sequencing

    NASA Technical Reports Server (NTRS)

    Deamer, D. W.; Akeson, M.

    2000-01-01

    DNA and RNA molecules can be detected as they are driven through a nanopore by an applied electric field at rates ranging from several hundred microseconds to a few milliseconds per molecule. The nanopore can rapidly discriminate between pyrimidine and purine segments along a single-stranded nucleic acid molecule. Nanopore detection and characterization of single molecules represents a new method for directly reading information encoded in linear polymers. If single-nucleotide resolution can be achieved, it is possible that nucleic acid sequences can be determined at rates exceeding a thousand bases per second.

  2. Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes

    PubMed Central

    2015-01-01

    Background Protein-protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only. Results This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors (physicochemical, energetic and conformational properties of the 20 amino acids) to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.89, 0.83, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to estimate the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn. Conclusions The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein

  3. BOOGIE: Predicting Blood Groups from High Throughput Sequencing Data

    PubMed Central

    Giollo, Manuel; Minervini, Giovanni; Scalzotto, Marta; Leonardi, Emanuela; Ferrari, Carlo; Tosatto, Silvio C. E.

    2015-01-01

    Over the last decade, we have witnessed an incredible growth in the amount of available genotype data due to high throughput sequencing (HTS) techniques. This information may be used to predict phenotypes of medical relevance, and pave the way towards personalized medicine. Blood phenotypes (e.g. ABO and Rh) are a purely genetic trait that has been extensively studied for decades, with currently over thirty known blood groups. Given the public availability of blood group data, it is of interest to predict these phenotypes from HTS data which may translate into more accurate blood typing in clinical practice. Here we propose BOOGIE, a fast predictor for the inference of blood groups from single nucleotide variant (SNV) databases. We focus on the prediction of thirty blood groups ranging from the well known ABO and Rh, to the less studied Junior or Diego. BOOGIE correctly predicted the blood group with 94% accuracy for the Personal Genome Project whole genome profiles where good quality SNV annotation was available. Additionally, our tool produces a high quality haplotype phase, which is of interest in the context of ethnicity-specific polymorphisms or traits. The versatility and simplicity of the analysis make it easily interpretable and allow easy extension of the protocol towards other phenotypes. BOOGIE can be downloaded from URL http://protein.bio.unipd.it/download/. PMID:25893845

  4. Amino acid sequence of the Amur tiger prion protein.

    PubMed

    Wu, Changde; Pang, Wanyong; Zhao, Deming

    2006-10-01

    Prion diseases are fatal neurodegenerative disorders in human and animal associated with conformational conversion of a cellular prion protein (PrP(C)) into the pathologic isoform (PrP(Sc)). Various data indicate that the polymorphisms within the open reading frame (ORF) of PrP are associated with the susceptibility and control the species barrier in prion diseases. In the present study, partial Prnp from 25 Amur tigers (tPrnp) were cloned and screened for polymorphisms. Four single nucleotide polymorphisms (T423C, A501G, C511A, A610G) were found; the C511A and A610G nucleotide substitutions resulted in the amino acid changes Lysine171Glutamine and Alanine204Threoine, respectively. The tPrnp amino acid sequence is similar to house cat (Felis catus ) and sheep, but differs significantly from other two cat Prnp sequences that were previously deposited in GenBank. PMID:16780982

  5. Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

    PubMed Central

    2013-01-01

    Background Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment. Results On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However

  6. Structure prediction and analysis of neuraminidase sequence variants.

    PubMed

    Thayer, Kelly M

    2016-07-01

    Analyzing protein structure has become an integral aspect of understanding systems of biochemical import. The laboratory experiment endeavors to introduce protein folding to ascertain structures of proteins for which the structure is unavailable, as well as to critically evaluate the quality of the prediction obtained. The model system used is the highly mutable influenza virus protein neuraminidase, which is the key target in the development of therapeutics. In light of recent pandemics, understanding how mutations confer drug resistance, which translates at the molecular level to understanding how different sequence variants differ, constitutes an area of great interest because of the ramifications in public health. This lab targets upper level undergraduate biochemistry students, and aims to introduce tools to be used to explore protein folding and protein visualization in the context of the neuraminidase case study. Students proceed to critically evaluate the folded models by comparison with crystallographic structures. When validity is established, they fold a neuraminidase sequence for which a structure is not available. Through structural alignment and visual inspection of the 150 loop, students gain molecular insight into two possible conformations of the protein, which are actively being studied. Folding the third chosen sequence mimics a true research environment in allowing students to generate a structure from a sequence for which a structure was not previously available, and to assess whether their particular variant has an open or closed loop. From this vantage, they are then challenged to speculate about the connection between loop conformation and drug susceptibility. © 2016 by The International Union of Biochemistry and Molecular Biology, 44(4):361-376, 2016. PMID:26900942

  7. Using next generation transcriptome sequencing to predict an ectomycorrhizal metabolome

    PubMed Central

    2011-01-01

    Background Mycorrhizae, symbiotic interactions between soil fungi and tree roots, are ubiquitous in terrestrial ecosystems. The fungi contribute phosphorous, nitrogen and mobilized nutrients from organic matter in the soil and in return the fungus receives photosynthetically-derived carbohydrates. This union of plant and fungal metabolisms is the mycorrhizal metabolome. Understanding this symbiotic relationship at a molecular level provides important contributions to the understanding of forest ecosystems and global carbon cycling. Results We generated next generation short-read transcriptomic sequencing data from fully-formed ectomycorrhizae between Laccaria bicolor and aspen (Populus tremuloides) roots. The transcriptomic data was used to identify statistically significantly expressed gene models using a bootstrap-style approach, and these expressed genes were mapped to specific metabolic pathways. Integration of expressed genes that code for metabolic enzymes and the set of expressed membrane transporters generates a predictive model of the ectomycorrhizal metabolome. The generated model of mycorrhizal metabolome predicts that the specific compounds glycine, glutamate, and allantoin are synthesized by L. bicolor and that these compounds or their metabolites may be used for the benefit of aspen in exchange for the photosynthetically-derived sugars fructose and glucose. Conclusions The analysis illustrates an approach to generate testable biological hypotheses to investigate the complex molecular interactions that drive ectomycorrhizal symbiosis. These models are consistent with experimental environmental data and provide insight into the molecular exchange processes for organisms in this complex ecosystem. The method used here for predicting metabolomic models of mycorrhizal systems from deep RNA sequencing data can be generalized and is broadly applicable to transcriptomic data derived from complex systems. PMID:21569493

  8. Using next generation transcriptome sequencing to predict an ectomycorrhizal metablome.

    SciTech Connect

    Larsen, P. E.; Sreedasyam, A.; Trivedi, G; Podila, G. K.; Cseke, L. J.; Collart, F. R.

    2011-05-13

    Mycorrhizae, symbiotic interactions between soil fungi and tree roots, are ubiquitous in terrestrial ecosystems. The fungi contribute phosphorous, nitrogen and mobilized nutrients from organic matter in the soil and in return the fungus receives photosynthetically-derived carbohydrates. This union of plant and fungal metabolisms is the mycorrhizal metabolome. Understanding this symbiotic relationship at a molecular level provides important contributions to the understanding of forest ecosystems and global carbon cycling. We generated next generation short-read transcriptomic sequencing data from fully-formed ectomycorrhizae between Laccaria bicolor and aspen (Populus tremuloides) roots. The transcriptomic data was used to identify statistically significantly expressed gene models using a bootstrap-style approach, and these expressed genes were mapped to specific metabolic pathways. Integration of expressed genes that code for metabolic enzymes and the set of expressed membrane transporters generates a predictive model of the ectomycorrhizal metabolome. The generated model of mycorrhizal metabolome predicts that the specific compounds glycine, glutamate, and allantoin are synthesized by L. bicolor and that these compounds or their metabolites may be used for the benefit of aspen in exchange for the photosynthetically-derived sugars fructose and glucose. The analysis illustrates an approach to generate testable biological hypotheses to investigate the complex molecular interactions that drive ectomycorrhizal symbiosis. These models are consistent with experimental environmental data and provide insight into the molecular exchange processes for organisms in this complex ecosystem. The method used here for predicting metabolomic models of mycorrhizal systems from deep RNA sequencing data can be generalized and is broadly applicable to transcriptomic data derived from complex systems.

  9. Predictions of diagenetic reactions in the presence of organic acids

    NASA Astrophysics Data System (ADS)

    Harrison, Wendy J.; Thyne, Geoffrey D.

    1992-02-01

    Stability constants have been estimated for cation complexes with anions of monofunctional and difunctional acids (combinations of Ca, Mg, Fe, Al, Sr, Mn, U, Th, Pb, Cu, Zn with formate, acetate, propionate, oxalate, malonate, succinate, and salicylate) between 0 and 200°C. Difunctional acid anions form much more stable complexes than monofunctional acid anions with aluminum; the importance of the aluminum-acetate complex is relatively minor in comparison to aluminum oxalate and malonate complexes. Divalent metal cations such as Mg, Ca, and Fe form more stable complexes with acetate than with difunctional acid anions. Aluminum-oxalate can dominate the species distribution of aluminum under acidic pH conditions, whereas the divalent cation-acetate and oxalate complexes rarely account for more than 60% of the total dissolved cation, and then only in more alkaline waters. Mineral thermodynamic affinities were calculated using the reaction path model EQ3/6 for waters having variable organic acid anion (OAA) contents under conditions representative of those found during normal burial diagenesis. The following scenarios are possible: 1) K-feldspar and albite are stable, anorthite dissolves 2) All feldpars are stable 3) Carbonates can be very unstable to slightly unstable, but never increase in stability. Organic acid anions are ineffective at neutral to alkaline pH in modifying stabilities of aluminosilicate minerals whereas the anions are variably effective under a wide range of pH in modifying carbonate mineral stabilities. Reaction path calculations demonstrate that the sequence of mineral reactions occurring in an arkosic sandstone-fluid system is only slightly modified by the presence of OAA. A spectrum of possible sandstone alteration mineralogies can be obtained depending on the selected boundary conditions: EQ3/6 predictions include quartz overgrowth, calcite replacement of plagioclase, albitization of plagioclase, and the formation of porosity-occluding calcite

  10. Mfold web server for nucleic acid folding and hybridization prediction

    PubMed Central

    Zuker, Michael

    2003-01-01

    The abbreviated name, ‘mfold web server’, describes a number of closely related software applications available on the World Wide Web (WWW) for the prediction of the secondary structure of single stranded nucleic acids. The objective of this web server is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large. By making use of universally available web GUIs (Graphical User Interfaces), the server circumvents the problem of portability of this software. Detailed output, in the form of structure plots with or without reliability information, single strand frequency plots and ‘energy dot plots’, are available for the folding of single sequences. A variety of ‘bulk’ servers give less information, but in a shorter time and for up to hundreds of sequences at once. The portal for the mfold web server is http://www.bioinfo.rpi.edu/applications/mfold. This URL will be referred to as ‘MFOLDROOT’. PMID:12824337

  11. Improving HIV coreceptor usage prediction in the clinic using hints from next-generation sequencing data

    PubMed Central

    Pfeifer, Nico; Lengauer, Thomas

    2012-01-01

    Motivation: Due to the high mutation rate of human immunodeficiency virus (HIV), drug-resistant-variants emerge frequently. Therefore, researchers are constantly searching for new ways to attack the virus. One new class of anti-HIV drugs is the class of coreceptor antagonists that block cell entry by occupying a coreceptor on CD4 cells. This type of drug just has an effect on the subset of HIVs that use the inhibited coreceptor. A good prediction of whether the viral population inside a patient is susceptible to the treatment is hence very important for therapy decisions and pre-requisite to administering the respective drug. The first prediction models were based on data from Sanger sequencing of the V3 loop of HIV. Recently, a method based on next-generation sequencing (NGS) data was introduced that predicts labels for each read separately and decides on the patient label through a percentage threshold for the resistant viral minority. Results: We model the prediction problem on the patient level taking the information of all reads from NGS data jointly into account. This enables us to improve prediction performance for NGS data, but we can also use the trained model to improve predictions based on Sanger sequencing data. Therefore, also laboratories without NGS capabilities can benefit from the improvements. Furthermore, we show which amino acids at which position are important for prediction success, giving clues on how the interaction mechanism between the V3 loop and the particular coreceptors might be influenced. Availability: A webserver is available at http://coreceptor.bioinf.mpi-inf.mpg.de. Contact: nico.pfeifer@mpi-inf.mpg.de PMID:22962486

  12. Quantum-Sequencing: Biophysics of quantum tunneling through nucleic acids

    NASA Astrophysics Data System (ADS)

    Casamada Ribot, Josep; Chatterjee, Anushree; Nagpal, Prashant

    2014-03-01

    Tunneling microscopy and spectroscopy has extensively been used in physical surface sciences to study quantum tunneling to measure electronic local density of states of nanomaterials and to characterize adsorbed species. Quantum-Sequencing (Q-Seq) is a new method based on tunneling microscopy for electronic sequencing of single molecule of nucleic acids. A major goal of third-generation sequencing technologies is to develop a fast, reliable, enzyme-free single-molecule sequencing method. Here, we present the unique ``electronic fingerprints'' for all nucleotides on DNA and RNA using Q-Seq along their intrinsic biophysical parameters. We have analyzed tunneling spectra for the nucleotides at different pH conditions and analyzed the HOMO, LUMO and energy gap for all of them. In addition we show a number of biophysical parameters to further characterize all nucleobases (electron and hole transition voltage and energy barriers). These results highlight the robustness of Q-Seq as a technique for next-generation sequencing.

  13. Correlation between fibroin amino acid sequence and physical silk properties.

    PubMed

    Fedic, Robert; Zurovec, Michal; Sehnal, Frantisek

    2003-09-12

    The fiber properties of lepidopteran silk depend on the amino acid repeats that interact during H-fibroin polymerization. The aim of our research was to relate repeat composition to insect biology and fiber strength. Representative regions of the H-fibroin genes were sequenced and analyzed in three pyralid species: wax moth (Galleria mellonella), European flour moth (Ephestia kuehniella), and Indian meal moth (Plodia interpunctella). The amino acid repeats are species-specific, evidently a diversification of an ancestral region of 43 residues, and include three types of regularly dispersed motifs: modifications of GSSAASAA sequence, stretches of tripeptides GXZ where X and Z represent bulky residues, and sequences similar to PVIVIEE. No concatenations of GX dipeptide or alanine, which are typical for Bombyx silkworms and Antheraea silk moths, respectively, were found. Despite different repeat structure, the silks of G. mellonella and E. kuehniella exhibit similar tensile strength as the Bombyx and Antheraea silks. We suggest that in these latter two species, variations in the repeat length obstruct repeat alignment, but sufficiently long stretches of iterated residues get superposed to interact. In the pyralid H-fibroins, interactions of the widely separated and diverse motifs depend on the precision of repeat matching; silk is strong in G. mellonella and E. kuehniella, with 2-3 types of long homogeneous repeats, and nearly 10 times weaker in P. interpunctella, with seven types of shorter erratic repeats. The high proportion of large amino acids in the H-fibroin of pyralids has probably evolved in connection with the spinning habit of caterpillars that live in protective silk tubes and spin continuously, enlarging the tubes on one end and partly devouring the other one. The silk serves as a depot of energetically rich and essential amino acids that may be scarce in the diet. PMID:12816957

  14. On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.

    PubMed Central

    Kabsch, W; Sander, C

    1984-01-01

    The search for amino acid sequence homologies can be a powerful tool for predicting protein structure. Discovered sequence homologies are currently used in predicting the function of oncogene proteins. To sharpen this tool, we investigated the structural significance of short sequence homologies by searching proteins of known three-dimensional structure for subsequence identities. In 62 proteins with 10,000 residues, we found that the longest isolated homologies between unrelated proteins are five residues long. In 6 (out of 25) cases we saw surprising structural adaptability: the same five residues are part of an alpha-helix in one protein and part of a beta-strand in another protein. These examples show quantitatively that pentapeptide structure within a protein is strongly dependent on sequence context, a fact essentially ignored in most protein structure prediction methods: just considering the local sequence of five residues is not sufficient to predict correctly the local conformation (secondary structure). Cooperativity of length six or longer must be taken into account. Also, we are warned that in the growing practice of comparing a new protein sequence with a data base of known sequences, finding an identical pentapeptide sequence between two proteins is not a significant indication of structural similarity or of evolutionary kinship. PMID:6422466

  15. Amino acid sequence of the nonsecretory ribonuclease of human urine.

    PubMed

    Beintema, J J; Hofsteenge, J; Iwama, M; Morita, T; Ohgi, K; Irie, M; Sugiyama, R H; Schieven, G L; Dekker, C A; Glitz, D G

    1988-06-14

    The amino acid sequence of a nonsecretory ribonuclease isolated from human urine was determined except for the identity of the residue at position 7. Sequence information indicates that the ribonucleases of human liver and spleen and an eosinophil-derived neurotoxin are identical or very closely related gene products. The sequence is identical at about 30% of the amino acid positions with those of all of the secreted mammalian ribonucleases for which information is available. Identical residues include active-site residues histidine-12, histidine-119, and lysine-41, other residues known to be important for substrate binding and catalytic activity, and all eight half-cystine residues common to these enzymes. Major differences include a deletion of six residues in the (so-called) S-peptide loop, insertions of two, and nine residues, respectively, in three other external loops of the molecule, and an addition of three residues at the amino terminus. The sequence shows the human nonsecretory ribonuclease to belong to the same ribonuclease superfamily as the mammalian secretory ribonucleases, turtle pancreatic ribonuclease, and human angiogenin. Sequence data suggest that a gene duplication occurred in an ancient vertebrate ancestor; one branch led to the nonsecretory ribonuclease, while the other branch led to a second duplication, with one line leading to the secretory ribonucleases (in mammals) and the second line leading to pancreatic ribonuclease in turtle and an angiogenic factor in mammals (human angiogenin). The nonsecretory ribonuclease has five short carbohydrate chains attached via asparagine residues at the surface of the molecule; these chains may have been shortened by exoglycosidase action.(ABSTRACT TRUNCATED AT 250 WORDS) PMID:3166997

  16. Innovations in host and microbial sialic acid biosynthesis revealed by phylogenomic prediction of nonulosonic acid structure

    PubMed Central

    Lewis, Amanda L.; Desa, Nolan; Hansen, Elizabeth E.; Knirel, Yuriy A.; Gordon, Jeffrey I.; Gagneux, Pascal; Nizet, Victor; Varki, Ajit

    2009-01-01

    Sialic acids (Sias) are nonulosonic acid (NulO) sugars prominently displayed on vertebrate cells and occasionally mimicked by bacterial pathogens using homologous biosynthetic pathways. It has been suggested that Sias were an animal innovation and later emerged in pathogens by convergent evolution or horizontal gene transfer. To better illuminate the evolutionary processes underlying the phenomenon of Sia molecular mimicry, we performed phylogenomic analyses of biosynthetic pathways for Sias and related higher sugars derived from 5,7-diamino-3,5,7,9-tetradeoxynon-2-ulosonic acids. Examination of ≈1,000 sequenced microbial genomes indicated that such biosynthetic pathways are far more widely distributed than previously realized. Phylogenetic analysis, validated by targeted biochemistry, was used to predict NulO types (i.e., neuraminic, legionaminic, or pseudaminic acids) expressed by various organisms. This approach uncovered previously unreported occurrences of Sia pathways in pathogenic and symbiotic bacteria and identified at least one instance in which a human archaeal symbiont tentatively reported to express Sias in fact expressed the related pseudaminic acid structure. Evaluation of targeted phylogenies and protein domain organization revealed that the “unique” Sia biosynthetic pathway of animals was instead a much more ancient innovation. Pathway phylogenies suggest that bacterial pathogens may have acquired Sia expression via adaptation of pathways for legionaminic acid biosynthesis, one of at least 3 evolutionary paths for de novo Sia synthesis. Together, these data indicate that some of the long-standing paradigms in Sia biology should be reconsidered in a wider evolutionary context of the extended family of NulO sugars. PMID:19666579

  17. Molecular cloning and amino acid sequence of human 5-lipoxygenase

    SciTech Connect

    Matsumoto, T.; Funk, C.D.; Radmark, O.; Hoeoeg, J.O.; Joernvall, H.; Samuelsson, B.

    1988-01-01

    5-Lipoxygenase (EC 1.13.11.34), a Ca/sup 2 +/- and ATP-requiring enzyme, catalyzes the first two steps in the biosynthesis of the peptidoleukotrienes and the chemotactic factor leukotriene B/sub 4/. A cDNA clone corresponding to 5-lipoxygenase was isolated from a human lung lambda gt11 expression library by immunoscreening with a polyclonal antibody. Additional clones from a human placenta lambda gt11 cDNA library were obtained by plaque hybridization with the /sup 32/P-labeled lung cDNA clone. Sequence data obtained from several overlapping clones indicate that the composite DNAs contain the complete coding region for the enzyme. From the deduced primary structure, 5-lipoxygenase encodes a 673 amino acid protein with a calculated molecular weight of 77,839. Direct analysis of the native protein and its proteolytic fragments confirmed the deduced composition, the amino-terminal amino acid sequence, and the structure of many internal segments. 5-Lipoxygenase has no apparent sequence homology with leukotriene A/sub 4/ hydrolase or Ca/sup 2 +/-binding proteins. RNA blot analysis indicated substantial amounts of an mRNA species of approx. = 2700 nucleotides in leukocytes, lung, and placenta.

  18. Nucleic acid sequence detection using multiplexed oligonucleotide PCR

    DOEpatents

    Nolan, John P.; White, P. Scott

    2006-12-26

    Methods for rapidly detecting single or multiple sequence alleles in a sample nucleic acid are described. Provided are all of the oligonucleotide pairs capable of annealing specifically to a target allele and discriminating among possible sequences thereof, and ligating to each other to form an oligonucleotide complex when a particular sequence feature is present (or, alternatively, absent) in the sample nucleic acid. The design of each oligonucleotide pair permits the subsequent high-level PCR amplification of a specific amplicon when the oligonucleotide complex is formed, but not when the oligonucleotide complex is not formed. The presence or absence of the specific amplicon is used to detect the allele. Detection of the specific amplicon may be achieved using a variety of methods well known in the art, including without limitation, oligonucleotide capture onto DNA chips or microarrays, oligonucleotide capture onto beads or microspheres, electrophoresis, and mass spectrometry. Various labels and address-capture tags may be employed in the amplicon detection step of multiplexed assays, as further described herein.

  19. The amino acid sequence of rabbit muscle triose phosphate isomerase.

    PubMed Central

    Corran, P H; Waley, S G

    1975-01-01

    The amino acid sequence of rabbit muscle triose phosphate isomerase was deduced by characterizing peptides that overlap the tryptic peptides. Thiol groups were modified by oxidation, carboxymethylation or aminoen. About 50 peptides that provided information about overlaps were isolated; the peptides were mostly characterized by their compositions and N-terminal residues. The peptide chains contain 248 amino acid residues, and no evidence for dissimilarity of the two subunits that comprise the native enzyme was found. The sequence of the rabbit muscle enzyme may be compared with that of the coelacanth enzyme (Kolb et al., 1974): 84% of the residues are in identical positions. Similarly, comparison of the sequence with that inferred for the chicken enzyme (Furth et al., 1974) shows that 87% of the residues are in identical positions. Limited though these comparisons are, they suggest that triose phosphate isomerase has one of the lowest rates of evolutionary change. An extended version of the present paper has been deposited as Supplementary Publication SUP 50040 (42 pages) at the British Library (Lending Division) (formerly the National Lending Library for Science and Technology), Boston Spa, Yorks. LS23 7BQ, U.K., from whom copies can be obtained on the terms given in Biochem. J. (1975) 145, 5. PMID:1171682

  20. The amino acid sequence of chymopapain from Carica papaya.

    PubMed Central

    Watson, D C; Yaguchi, M; Lynn, K R

    1990-01-01

    Chymopapain is a polypeptide of 218 amino acid residues. It has considerable structural similarity with papain and papaya proteinase omega, including conservation of the catalytic site and of the disulphide bonding. Chymopapain is like papaya proteinase omega in carrying four extra residues between papain positions 168 and 169, but differs from both papaya proteinases in the composition of its S2 subsite, as well as in having a second thiol group, Cys-117. Some evidence for the amino acid sequence of chymopapain has been deposited as Supplementary Publication SUP 50153 (12 pages) at the British Library Document Supply Centre, Boston Spa., Wetherby, West Yorkshire LS23 7BQ, U.K., from whom copies may be obtained on the terms indicated in Biochem. J. (1990) 265, 5. The information comprises Supplement Tables 1-4, which contain, in order, amino acid compositions of peptides from tryptic, peptic, CNBr and mild acid cleavages, Supplement Fig. 1, showing re-fractionation of selected peaks from Fig. 2 of the main paper. Supplement Fig. 2, showing cation-exchange chromatography of the earliest-eluted peak of Fig. 3 of the main paper, Supplement Fig. 3, showing reverse-phase h.p.l.c. of the later-eluted peak from Fig. 3 of the main paper, and Supplement Fig. 4, showing the separation of peptides after mild acid hydrolysis of CNBr-cleavage fragment CB3. PMID:2106878

  1. The amino acid sequence of chymopapain from Carica papaya.

    PubMed

    Watson, D C; Yaguchi, M; Lynn, K R

    1990-02-15

    Chymopapain is a polypeptide of 218 amino acid residues. It has considerable structural similarity with papain and papaya proteinase omega, including conservation of the catalytic site and of the disulphide bonding. Chymopapain is like papaya proteinase omega in carrying four extra residues between papain positions 168 and 169, but differs from both papaya proteinases in the composition of its S2 subsite, as well as in having a second thiol group, Cys-117. Some evidence for the amino acid sequence of chymopapain has been deposited as Supplementary Publication SUP 50153 (12 pages) at the British Library Document Supply Centre, Boston Spa., Wetherby, West Yorkshire LS23 7BQ, U.K., from whom copies may be obtained on the terms indicated in Biochem. J. (1990) 265, 5. The information comprises Supplement Tables 1-4, which contain, in order, amino acid compositions of peptides from tryptic, peptic, CNBr and mild acid cleavages, Supplement Fig. 1, showing re-fractionation of selected peaks from Fig. 2 of the main paper. Supplement Fig. 2, showing cation-exchange chromatography of the earliest-eluted peak of Fig. 3 of the main paper, Supplement Fig. 3, showing reverse-phase h.p.l.c. of the later-eluted peak from Fig. 3 of the main paper, and Supplement Fig. 4, showing the separation of peptides after mild acid hydrolysis of CNBr-cleavage fragment CB3. PMID:2106878

  2. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space.

    PubMed

    Fromer, Menachem; Yanover, Chen

    2009-05-15

    The task of engineering a protein to assume a target three-dimensional structure is known as protein design. Computational search algorithms are devised to predict a minimal energy amino acid sequence for a particular structure. In practice, however, an ensemble of low-energy sequences is often sought. Primarily, this is performed because an individual predicted low-energy sequence may not necessarily fold to the target structure because of both inaccuracies in modeling protein energetics and the nonoptimal nature of search algorithms employed. Additionally, some low-energy sequences may be overly stable and thus lack the dynamic flexibility required for biological functionality. Furthermore, the investigation of low-energy sequence ensembles will provide crucial insights into the pseudo-physical energy force fields that have been derived to describe structural energetics for protein design. Significantly, numerous studies have predicted low-energy sequences, which were subsequently synthesized and demonstrated to fold to desired structures. However, the characterization of the sequence space defined by such energy functions as compatible with a target structure has not been performed in full detail. This issue is critical for protein design scientists to successfully continue using these force fields at an ever-increasing pace and scale. In this paper, we present a conceptually novel algorithm that rapidly predicts the set of lowest energy sequences for a given structure. Based on the theory of probabilistic graphical models, it performs efficient inspection and partitioning of the near-optimal sequence space, without making any assumptions of positional independence. We benchmark its performance on a diverse set of relevant protein design examples and show that it consistently yields sequences of lower energy than those derived from state-of-the-art techniques. Thus, we find that previously presented search techniques do not fully depict the low-energy space as

  3. Use of a structural alphabet to find compatible folds for amino acid sequences

    PubMed Central

    Mahajan, Swapnil; de Brevern, Alexandre G; Sanejouand, Yves-Henri; Srinivasan, Narayanaswamy; Offmann, Bernard

    2015-01-01

    The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence-search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino-acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as “Protein Blocks” (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence-search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z-score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales-up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web-server that is freely available at http://www.bo-protscience.fr/forsa. PMID:25297700

  4. Use of a structural alphabet to find compatible folds for amino acid sequences.

    PubMed

    Mahajan, Swapnil; de Brevern, Alexandre G; Sanejouand, Yves-Henri; Srinivasan, Narayanaswamy; Offmann, Bernard

    2015-01-01

    The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence-search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino-acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as "Protein Blocks" (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence-search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z-score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales-up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web-server that is freely available at http://www.bo-protscience.fr/forsa. PMID:25297700

  5. Predicting the virulence of MRSA from its genome sequence

    PubMed Central

    Laabei, Maisem; Recker, Mario; Rudkin, Justine K.; Aldeljawi, Mona; Gulay, Zeynep; Sloan, Tim J.; Williams, Paul; Endres, Jennifer L.; Bayles, Kenneth W.; Fey, Paul D.; Yajjala, Vijaya Kumar; Widhelm, Todd; Hawkins, Erica; Lewis, Katie; Parfett, Sara; Scowen, Lucy; Peacock, Sharon J.; Holden, Matthew; Wilson, Daniel; Read, Timothy D.; van den Elsen, Jean; Priest, Nicholas K.; Feil, Edward J.; Hurst, Laurence D.; Josefsson, Elisabet; Massey, Ruth C.

    2014-01-01

    Microbial virulence is a complex and often multifactorial phenotype, intricately linked to a pathogen’s evolutionary trajectory. Toxicity, the ability to destroy host cell membranes, and adhesion, the ability to adhere to human tissues, are the major virulence factors of many bacterial pathogens, including Staphylococcus aureus. Here, we assayed the toxicity and adhesiveness of 90 MRSA (methicillin resistant S. aureus) isolates and found that while there was remarkably little variation in adhesion, toxicity varied by over an order of magnitude between isolates, suggesting different evolutionary selection pressures acting on these two traits. We performed a genome-wide association study (GWAS) and identified a large number of loci, as well as a putative network of epistatically interacting loci, that significantly associated with toxicity. Despite this apparent complexity in toxicity regulation, a predictive model based on a set of significant single nucleotide polymorphisms (SNPs) and insertion and deletions events (indels) showed a high degree of accuracy in predicting an isolate’s toxicity solely from the genetic signature at these sites. Our results thus highlight the potential of using sequence data to determine clinically relevant parameters and have further implications for understanding the microbial virulence of this opportunistic pathogen. PMID:24717264

  6. Predicting the virulence of MRSA from its genome sequence.

    PubMed

    Laabei, Maisem; Recker, Mario; Rudkin, Justine K; Aldeljawi, Mona; Gulay, Zeynep; Sloan, Tim J; Williams, Paul; Endres, Jennifer L; Bayles, Kenneth W; Fey, Paul D; Yajjala, Vijaya Kumar; Widhelm, Todd; Hawkins, Erica; Lewis, Katie; Parfett, Sara; Scowen, Lucy; Peacock, Sharon J; Holden, Matthew; Wilson, Daniel; Read, Timothy D; van den Elsen, Jean; Priest, Nicholas K; Feil, Edward J; Hurst, Laurence D; Josefsson, Elisabet; Massey, Ruth C

    2014-05-01

    Microbial virulence is a complex and often multifactorial phenotype, intricately linked to a pathogen's evolutionary trajectory. Toxicity, the ability to destroy host cell membranes, and adhesion, the ability to adhere to human tissues, are the major virulence factors of many bacterial pathogens, including Staphylococcus aureus. Here, we assayed the toxicity and adhesiveness of 90 MRSA (methicillin resistant S. aureus) isolates and found that while there was remarkably little variation in adhesion, toxicity varied by over an order of magnitude between isolates, suggesting different evolutionary selection pressures acting on these two traits. We performed a genome-wide association study (GWAS) and identified a large number of loci, as well as a putative network of epistatically interacting loci, that significantly associated with toxicity. Despite this apparent complexity in toxicity regulation, a predictive model based on a set of significant single nucleotide polymorphisms (SNPs) and insertion and deletions events (indels) showed a high degree of accuracy in predicting an isolate's toxicity solely from the genetic signature at these sites. Our results thus highlight the potential of using sequence data to determine clinically relevant parameters and have further implications for understanding the microbial virulence of this opportunistic pathogen. PMID:24717264

  7. Deduced amino acid sequence of human pulmonary surfactant proteolipid: SPL(pVal)

    SciTech Connect

    Whitsett, J.A.; Glasser, S.W.; Korfhagen, T.R.; Weaver, T.E.; Clark, J.; Pilot-Matias, T.; Meuth, J.; Fox, J.L.

    1987-05-01

    Hydrophobic, proteolipid-like protein of Mr 6500 was isolated from ether/ethanol extracts of human, canine and bovine pulmonary surfactant. Amino acid composition of the protein demonstrated a remarkable abundance of hydrophobic residues, particularly valine and leucine. The N-terminal amino acid sequence of the human protein was determined: N-Leu-Ile-Pro-Cys-Cys-Pro-Val-Asn-Leu-Lys-Arg-Leu-Leu-Ile-Val4... An oligonucleotide probe was used to screen an adult human lung cDNA library and resulted in detection of cDNA clones with predicted amino acid sequence with close identity to the N-terminal amino acid sequence of the human peptide. SPL(pVal) was found within the reading frame of a larger peptide. SPL(pVal) results from proteolytic processing of a larger preprotein. Northern blot analysis detected in a single 1.0 kilobase SPL(pVal) RNA which was less abundant in fetal than in adult lung. Mixtures of purified canine and bovine SPL(pVal) and synthetic phospholipids display properties of rapid adsorption and surface tension lowering activity characteristic of surfactant. Human SPL(pVal) is a pulmonary surfactant proteolipid which may therefore be useful in combination with phospholipids and/or other surfactant proteins for the treatment of surfactant deficiency such as hyaline membrane disease in newborn infants.

  8. Amino acid sequence prerequisites for the formation of cn ions.

    PubMed

    Downard, K M; Biemann, K

    1993-11-01

    Ammo acid sequence prerequisites are described for the formation of c, ions observed in high-energy collision-induced decomposition spectra of peptides. It is shown that the formation of cn ions is promoted by the nature of the amino acid C-terminal to the cleavage site. A propensity for cn cleavage preceding threonine, and to a lesser extent tryptophan, lysine, and serine, is demonstrated where fragmentation is directed N-terminally at these residues. In addition, the nature of the residue N-terminal to the cleavage site is shown to have little effect on cn ion formation. A mechanism for cn ion formation is proposed and its applicability to the results observed is discussed. PMID:24227531

  9. Ultrasensitive nucleic acid sequence detection by single-molecule electrophoresis

    SciTech Connect

    Castro, A; Shera, E.B.

    1996-09-01

    This is the final report of a one-year laboratory-directed research and development project at Los Alamos National Laboratory. There has been considerable interest in the development of very sensitive clinical diagnostic techniques over the last few years. Many pathogenic agents are often present in extremely small concentrations in clinical samples, especially at the initial stages of infection, making their detection very difficult. This project sought to develop a new technique for the detection and accurate quantification of specific bacterial and viral nucleic acid sequences in clinical samples. The scheme involved the use of novel hybridization probes for the detection of nucleic acids combined with our recently developed technique of single-molecule electrophoresis. This project is directly relevant to the DOE`s Defense Programs strategic directions in the area of biological warfare counter-proliferation.

  10. MitoFates: Improved Prediction of Mitochondrial Targeting Sequences and Their Cleavage Sites*

    PubMed Central

    Fukasawa, Yoshinori; Tsuji, Junko; Fu, Szu-Chin; Tomii, Kentaro; Horton, Paul; Imai, Kenichiro

    2015-01-01

    Mitochondria provide numerous essential functions for cells and their dysfunction leads to a variety of diseases. Thus, obtaining a complete mitochondrial proteome should be a crucial step toward understanding the roles of mitochondria. Many mitochondrial proteins have been identified experimentally but a complete list is not yet available. To fill this gap, methods to computationally predict mitochondrial proteins from amino acid sequence have been developed and are widely used, but unfortunately, their accuracy is far from perfect. Here we describe MitoFates, an improved prediction method for cleavable N-terminal mitochondrial targeting signals (presequences) and their cleavage sites. MitoFates introduces novel sequence features including positively charged amphiphilicity, presequence motifs, and position weight matrices modeling the presequence cleavage sites. These features are combined with classical ones such as amino acid composition and physico-chemical properties as input to a standard support vector machine classifier. On independent test data, MitoFates attains better performance than existing predictors in both detection of presequences and in predicting their cleavage sites. We used MitoFates to look for undiscovered mitochondrial proteins from 42,217 human proteins (including isoforms such as alternative splicing or translation initiation variants). MitoFates predicts 1167 genes to have at least one isoform with a presequence. Five-hundred and eighty of these genes were not annotated as mitochondrial in either UniProt or Gene Ontology. Interestingly, these include candidate regulators of parkin translocation to damaged mitochondria, and also many genes with known disease mutations, suggesting that careful investigation of MitoFates predictions may be helpful in elucidating the role of mitochondria in health and disease. MitoFates is open source with a convenient web server publicly available. PMID:25670805

  11. Coevolutionary modeling of protein sequences: Predicting structure, function, and mutational landscapes

    NASA Astrophysics Data System (ADS)

    Weigt, Martin

    Over the last years, biological research has been revolutionized by experimental high-throughput techniques, in particular by next-generation sequencing technology. Unprecedented amounts of data are accumulating, and there is a growing request for computational methods unveiling the information hidden in raw data, thereby increasing our understanding of complex biological systems. Statistical-physics models based on the maximum-entropy principle have, in the last few years, played an important role in this context. To give a specific example, proteins and many non-coding RNA show a remarkable degree of structural and functional conservation in the course of evolution, despite a large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach - called Direct-Coupling Analysis - to link this sequence variability (easy to observe in sequence alignments, which are available in public sequence databases) to bio-molecular structure and function. In my presentation I will show, how this methodology can be used (i) to infer contacts between residues and thus to guide tertiary and quaternary protein structure prediction and RNA structure prediction, (ii) to discriminate interacting from non-interacting protein families, and thus to infer conserved protein-protein interaction networks, and (iii) to reconstruct mutational landscapes and thus to predict the phenotypic effect of mutations. References [1] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon and M. Weigt ''Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1'', Mol. Biol. Evol. (2015), doi: 10.1093/molbev/msv211 [2] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monasson, A. Schug, M. Weigt ''Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction'', Nucleic Acids Research (2015), doi: 10.1093/nar/gkv932 [3] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. Marks, C

  12. Propensity scores for prediction and characterization of bioluminescent proteins from sequences.

    PubMed

    Huang, Hui-Ling

    2014-01-01

    Bioluminescent proteins (BLPs) are a class of proteins with various mechanisms of light emission such as bioluminescence and fluorescence from luminous organisms. While valuable for commercial and medical applications, identification of BLPs, including luciferases and fluorescent proteins (FPs), is rather challenging, owing to their high variety of protein sequences. Moreover, characterization of BLPs facilitates mutagenesis analysis to enhance bioluminescence and fluorescence. Therefore, this study proposes a novel methodological approach to estimating the propensity scores of 400 dipeptides and 20 amino acids in order to design two prediction methods and characterize BLPs based on a scoring card method (SCM). The SCMBLP method for predicting BLPs achieves an accuracy of 90.83% for 10-fold cross-validation higher than existing support vector machine based methods and a test accuracy of 82.85%. A dataset consisting of 269 luciferases and 216 FPs is also established to design the SCMLFP prediction method, which achieves training and test accuracies of 97.10% and 96.28%, respectively. Additionally, four informative physicochemical properties of 20 amino acids are identified using the estimated propensity scores to characterize BLPs as follows: 1) high transfer free energy from inside to the protein surface, 2) high occurrence frequency of residues in the transmembrane regions of the protein, 3) large hydrophobicity scale from the native protein structure, and 4) high correlation coefficient (R = 0.921) between the amino acid compositions of BLPs and integral membrane proteins. Further analyzing BLPs reveals that luciferases have a larger value of R (0.937) than FPs (0.635), suggesting that luciferases tend to locate near the cell membrane location rather than FPs for convenient receipt of extracellular ions. Importantly, the propensity scores of dipeptides and amino acids and the identified properties facilitate efforts to predict, characterize, and apply BLPs

  13. Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information

    SciTech Connect

    Petritis, Konstantinos; Kangas, Lars J.; Yan, Bo; Monroe, Matthew E.; Strittmatter, Eric F.; Qian, Weijun; Adkins, Joshua N.; Moore, Ronald J.; Xu, Ying; Lipton, Mary S.; Camp, David G.; Smith, Richard D.

    2006-07-15

    We describe an improved artificial neural network (ANN)-based method for predicting peptide retention times in reversed phase liquid chromatography. In addition to the peptide amino acid composition, this study investigated several other peptide descriptors to improve the predictive capability, such as peptide length, sequence, hydrophobicity and hydrophobic moment, and nearest neighbor amino acid, as well as peptide predicted structural configurations (i.e., helix, sheet, coil). An ANN architecture that consisted of 1052 input nodes, 24 hidden nodes, and 1 output node was used to fully consider the amino acid residue sequence in each peptide. The network was trained using {approx}345,000 non-redundant peptides identified from a total of 12,059 LC-MS/MS analyses of more than 20 different organisms, and the predictive capability of the model was tested using 1303 confidently identified peptides that were not included in the training set. The model demonstrated an average elution time precision of {approx}1.5% and was able to distinguish among isomeric peptides based upon the inclusion of peptide sequence information. The prediction power represents a significant improvement over our earlier report (Petritis et al., Anal. Chem. 2003, 75, 1039-1048) and other previously reported models.

  14. Predictive sequence analysis of the Candidatus Liberibacter asiaticus proteome.

    PubMed

    Cong, Qian; Kinch, Lisa N; Kim, Bong-Hyun; Grishin, Nick V

    2012-01-01

    Candidatus Liberibacter asiaticus (Ca. L. asiaticus) is a parasitic gram-negative bacterium that is closely associated with Huanglongbing (HLB), a worldwide citrus disease. Given the difficulty in culturing the bacterium and thus in its experimental characterization, computational analyses of the whole Ca. L. asiaticus proteome can provide much needed insights into the mechanisms of the disease and guide the development of treatment strategies. In this study, we applied state-of-the-art sequence analysis tools to every Ca. L. asiaticus protein. Our results are available as a public website at http://prodata.swmed.edu/liberibacter_asiaticus/. In particular, we manually curated the results to predict the subcellular localization, spatial structure and function of all Ca. L. asiaticus proteins (http://prodata.swmed.edu/liberibacter_asiaticus/curated/). This extensive information should facilitate the study of Ca. L. asiaticus proteome function and its relationship to disease. Pilot studies based on the information from our website have revealed several potential virulence factors, discussed herein. PMID:22815919

  15. Predictive Sequence Analysis of the Candidatus Liberibacter asiaticus Proteome

    PubMed Central

    Cong, Qian; Kinch, Lisa N.; Kim, Bong-Hyun; Grishin, Nick V.

    2012-01-01

    Candidatus Liberibacter asiaticus (Ca. L. asiaticus) is a parasitic Gram-negative bacterium that is closely associated with Huanglongbing (HLB), a worldwide citrus disease. Given the difficulty in culturing the bacterium and thus in its experimental characterization, computational analyses of the whole Ca. L. asiaticus proteome can provide much needed insights into the mechanisms of the disease and guide the development of treatment strategies. In this study, we applied state-of-the-art sequence analysis tools to every Ca. L. asiaticus protein. Our results are available as a public website at http://prodata.swmed.edu/liberibacter_asiaticus/. In particular, we manually curated the results to predict the subcellular localization, spatial structure and function of all Ca. L. asiaticus proteins (http://prodata.swmed.edu/liberibacter_asiaticus/curated/). This extensive information should facilitate the study of Ca. L. asiaticus proteome function and its relationship to disease. Pilot studies based on the information from our website have revealed several potential virulence factors, discussed herein. PMID:22815919

  16. [Prediction of lipases types by different scale pseudo-amino acid composition].

    PubMed

    Zhang, Guangya; Li, Hongchun; Gao, Jiaqiang; Fang, Baishan

    2008-11-01

    Lipases are widely used enzymes in biotechnology. Although they catalyze the same reaction, their sequences vary. Therefore, it is highly desired to develop a fast and reliable method to identify the types of lipases according to their sequences, or even just to confirm whether they are lipases or not. By proposing two scales based pseudo amino acid composition approaches to extract the features of the sequences, a powerful predictor based on k-nearest neighbor was introduced to address the problems. The overall success rates thus obtained by the 10-fold cross-validation test were shown as below: for predicting lipases and nonlipase, the success rates were 92.8%, 91.4% and 91.3%, respectively. For lipase types, the success rates were 92.3%, 90.3% and 89.7%, respectively. Among them, the Z scales based pseudo amino acid composition was the best, T scales was the second. They outperformed significantly than 6 other frequently used sequence feature extraction methods. The high success rates yielded for such a stringent dataset indicate predicting the types of lipases is feasible and the different scales pseudo amino acid composition might be a useful tool for extracting the features of protein sequences, or at lease can play a complementary role to many of the other existing approaches. PMID:19256347

  17. Improving protein structure prediction using multiple sequence-based contact predictions

    PubMed Central

    Wu, Sitao; Szilagyi, Andras; Zhang, Yang

    2011-01-01

    Summary Although residue-residue contact maps dictate the topology of proteins, sequence-based ab initio contact predictions have been found little use in actual structure prediction due to the low accuracy. We developed a composite set of nine SVM-based contact predictors which are used in I-TASSER simulation in combination with sparse template contact restraints. When testing the strategy on 273 non-homologous targets, remarkable improvements of I-TASSER models were observed for both easy and hard targets, with P-value by student s t-test below 0.00001 and 0.001, respectively. In several cases, TM-score increases by >30%, which essentially converts “non-foldable” targets into “foldable” ones. In CASP9, I-TASSER employed ab initio contact predictions, and generated models for 26 FM targets with a GDT-score 16% and 44% higher than the second and third best servers from other groups, respectively. These findings demonstrate a new avenue to improve the accuracy of protein structure prediction especially for free-modeling targets. PMID:21827953

  18. Genome-Wide Prediction and Analysis of 3D-Domain Swapped Proteins in the Human Genome from Sequence Information

    PubMed Central

    Upadhyay, Atul Kumar; Sowdhamini, Ramanathan

    2016-01-01

    3D-domain swapping is one of the mechanisms of protein oligomerization and the proteins exhibiting this phenomenon have many biological functions. These proteins, which undergo domain swapping, have acquired much attention owing to their involvement in human diseases, such as conformational diseases, amyloidosis, serpinopathies, proteionopathies etc. Early realisation of proteins in the whole human genome that retain tendency to domain swap will enable many aspects of disease control management. Predictive models were developed by using machine learning approaches with an average accuracy of 78% (85.6% of sensitivity, 87.5% of specificity and an MCC value of 0.72) to predict putative domain swapping in protein sequences. These models were applied to many complete genomes with special emphasis on the human genome. Nearly 44% of the protein sequences in the human genome were predicted positive for domain swapping. Enrichment analysis was performed on the positively predicted sequences from human genome for their domain distribution, disease association and functional importance based on Gene Ontology (GO). Enrichment analysis was also performed to infer a better understanding of the functional importance of these sequences. Finally, we developed hinge region prediction, in the given putative domain swapped sequence, by using important physicochemical properties of amino acids. PMID:27467780

  19. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... in the sequence. (4) The enumeration of amino acids may start at the first amino acid of the first..., counting backwards starting with the amino acid next to number 1. Otherwise, the enumeration of amino acids... sequence every 5 amino acids. The enumeration method for amino acid sequences that is set forth......

  20. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... in the sequence. (4) The enumeration of amino acids may start at the first amino acid of the first..., counting backwards starting with the amino acid next to number 1. Otherwise, the enumeration of amino acids... sequence every 5 amino acids. The enumeration method for amino acid sequences that is set forth......

  1. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... in the sequence. (4) The enumeration of amino acids may start at the first amino acid of the first..., counting backwards starting with the amino acid next to number 1. Otherwise, the enumeration of amino acids... sequence every 5 amino acids. The enumeration method for amino acid sequences that is set forth......

  2. Complete amino acid sequence of a histidine-rich proteolytic fragment of human ceruloplasmin.

    PubMed

    Kingston, I B; Kingston, B L; Putnam, F W

    1979-04-01

    The complete amino acid sequence has been determined for a fragment of human ceruloplasmin [ferroxidase; iron(II):oxygen oxidoreductase, EC 1.16.3.1]. The fragment (designated Cp F5) contains 159 amino acid residues and has a molecular weight of 18,650; it lacks carbohydrate, is rich in histidine, and contains one free cysteine that may be part of a copper-binding site. This fragment is present in most commercial preparations of ceruloplasmin, probably owing to proteolytic degradation, but can also be obtained by limited cleavage of single-chain ceruloplasmin with plasmin. Cp F5 probably is an intact domain attached to the COOH-terminal end of single-chain ceruloplasmin via a labile interdomain peptide bond. A model of the secondary structure predicted by empirical methods suggests that almost one-third of the amino acid residues are distributed in alpha helices, about a third in beta-sheet structure, and the remainder in beta turns and unidentified structures. Computer analysis of the amino acid sequence has not demonstrated a statistically significant relationship between this ceruloplasmin fragment and any other protein, but there is some evidence for an internal duplication. PMID:287005

  3. Uric acid excretion predicts increased aggression in urban adolescents.

    PubMed

    Mrug, Sylvie; Mrug, Michal

    2016-09-01

    Elevated levels of uric acid have been linked with impulsive and disinhibited behavior in clinical and community populations of adults, but no studies have examined uric acid in relation to adolescent aggression. This study examined the prospective role of uric acid in aggressive behavior among urban, low income adolescents, and whether this relationship varies by gender. A total of 84 adolescents (M age 13.36years; 50% male; 95% African American) self-reported on their physical aggression at baseline and 1.5years later. At baseline, the youth also completed a 12-h (overnight) urine collection at home which was used to measure uric acid excretion. After adjusting for baseline aggression and age, greater uric acid excretion predicted more frequent aggressive behavior at follow up, with no significant gender differences. The results suggest that lowering uric acid levels may help reduce youth aggression. PMID:27180134

  4. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM

    PubMed Central

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences. PMID:26788119

  5. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

    PubMed

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences. PMID:26788119

  6. Predicted Molecular Effects of Sequence Variants Link to System Level of Disease.

    PubMed

    Reeb, Jonas; Hecht, Maximilian; Mahlich, Yannick; Bromberg, Yana; Rost, Burkhard

    2016-08-01

    Developments in experimental and computational biology are advancing our understanding of how protein sequence variation impacts molecular protein function. However, the leap from the micro level of molecular function to the macro level of the whole organism, e.g. disease, remains barred. Here, we present new results emphasizing earlier work that suggested some links from molecular function to disease. We focused on non-synonymous single nucleotide variants, also referred to as single amino acid variants (SAVs). Building upon OMIA (Online Mendelian Inheritance in Animals), we introduced a curated set of 117 disease-causing SAVs in animals. Methods optimized to capture effects upon molecular function often correctly predict human (OMIM) and animal (OMIA) Mendelian disease-causing variants. We also predicted effects of human disease-causing variants in the mouse model, i.e. we put OMIM SAVs into mouse orthologs. Overall, fewer variants were predicted with effect in the model organism than in the original organism. Our results, along with other recent studies, demonstrate that predictions of molecular effects capture some important aspects of disease. Thus, in silico methods focusing on the micro level of molecular function can help to understand the macro system level of disease. PMID:27536940

  7. Thermodynamic prediction of hydrogen production from mixed-acid fermentations.

    PubMed

    Forrest, Andrea K; Wales, Melinda E; Holtzapple, Mark T

    2011-10-01

    The MixAlco™ process biologically converts biomass to carboxylate salts that may be chemically converted to a wide variety of chemicals and fuels. The process utilizes lignocellulosic biomass as feedstock (e.g., municipal solid waste, sewage sludge, and agricultural residues), creating an economic basis for sustainable biofuels. This study provides a thermodynamic analysis of hydrogen yield from mixed-acid fermentations from two feedstocks: paper and bagasse. During batch fermentations, hydrogen production, acid production, and sugar digestion were analyzed to determine the energy selectivity of each system. To predict hydrogen production during continuous operation, this energy selectivity was then applied to countercurrent fermentations of the same systems. The analysis successfully predicted hydrogen production from the paper fermentation to within 11% and the bagasse fermentation to within 21% of the actual production. The analysis was able to faithfully represent hydrogen production and represents a step forward in understanding and predicting hydrogen production from mixed-acid fermentations. PMID:21875794

  8. SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features.

    PubMed

    Yates, Christopher M; Filippis, Ioannis; Kelley, Lawrence A; Sternberg, Michael J E

    2014-07-15

    Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html. PMID:24810707

  9. Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences.

    PubMed

    Bose, M; Barber, Robert D

    2006-01-01

    Prophage loci often remain under-annotated or even unrecognized in prokaryotic genome sequencing projects. A PHP application, Prophage Finder, has been developed and implemented to predict prophage loci, based upon clusters of phage-related gene products encoded within DNA sequences. This application provides results detailing several facets of these clusters to facilitate rapid prediction and analysis of prophage sequences. Prophage Finder was tested using previously annotated prokaryotic genomic sequences with manually curated prophage loci as benchmarks. Additional analyses from Prophage Finder searches of several draft prokaryotic genome sequences are available through the Web site (http://bioinformatics.uwp.edu/~phage/DOEResults.php) to illustrate the potential of this application. PMID:16922685

  10. Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences

    PubMed Central

    2009-01-01

    Background Protein-protein interactions underlie many important biological processes. Computational prediction methods can nicely complement experimental approaches for identifying protein-protein interactions. Recently, a unique category of sequence-based prediction methods has been put forward - unique in the sense that it does not require homologous protein sequences. This enables it to be universally applicable to all protein sequences unlike many of previous sequence-based prediction methods. If effective as claimed, these new sequence-based, universally applicable prediction methods would have far-reaching utilities in many areas of biology research. Results Upon close survey, I realized that many of these new methods were ill-tested. In addition, newer methods were often published without performance comparison with previous ones. Thus, it is not clear how good they are and whether there are significant performance differences among them. In this study, I have implemented and thoroughly tested 4 different methods on large-scale, non-redundant data sets. It reveals several important points. First, significant performance differences are noted among different methods. Second, data sets typically used for training prediction methods appear significantly biased, limiting the general applicability of prediction methods trained with them. Third, there is still ample room for further developments. In addition, my analysis illustrates the importance of complementary performance measures coupled with right-sized data sets for meaningful benchmark tests. Conclusions The current study reveals the potentials and limits of the new category of sequence-based protein-protein interaction prediction methods, which in turn provides a firm ground for future endeavours in this important area of contemporary bioinformatics. PMID:20003442

  11. Wiggle—Predicting Functionally Flexible Regions from Primary Sequence

    PubMed Central

    Gu, Jenny; Gribskov, Michael; Bourne, Philip E

    2006-01-01

    The Wiggle series are support vector machine–based predictors that identify regions of functional flexibility using only protein sequence information. Functionally flexible regions are defined as regions that can adopt different conformational states and are assumed to be necessary for bioactivity. Many advances have been made in understanding the relationship between protein sequence and structure. This work contributes to those efforts by making strides to understand the relationship between protein sequence and flexibility. A coarse-grained protein dynamic modeling approach was used to generate the dataset required for support vector machine training. We define our regions of interest based on the participation of residues in correlated large-scale fluctuations. Even with this structure-based approach to computationally define regions of functional flexibility, predictors successfully extract sequence-flexibility relationships that have been experimentally confirmed to be functionally important. Thus, a sequence-based tool to identify flexible regions important for protein function has been created. The ability to identify functional flexibility using a sequence based approach complements structure-based definitions and will be especially useful for the large majority of proteins with unknown structures. The methodology offers promise to identify structural genomics targets amenable to crystallization and the possibility to engineer more flexible or rigid regions within proteins to modify their bioactivity. PMID:16839194

  12. Human retroviruses and AIDS 1996. A compilation and analysis of nucleic acid and amino acid sequences

    SciTech Connect

    Myers, G.; Foley, B.; Korber, B.; Mellors, J.W.; Jeang, K.T.; Wain-Hobson, S.

    1997-04-01

    This compendium and the accompanying floppy diskettes are the result of an effort to compile and rapidly publish all relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses. The scope of the compendium and database is best summarized by the five parts that it comprises: (1) Nuclear Acid Alignments and Sequences; (2) Amino Acid Alignments; (3) Analysis; (4) Related Sequences; and (5) Database Communications. Information within all the parts is updated throughout the year on the Web site, http://hiv-web.lanl.gov. While this publication could take the form of a review or sequence monograph, it is not so conceived. Instead, the literature from which the database is derived has simply been summarized and some elementary computational analyses have been performed upon the data. Interpretation and commentary have been avoided insofar as possible so that the reader can form his or her own judgments concerning the complex information. In addition to the general descriptions of the parts of the compendium, the user should read the individual introductions for each part.

  13. EASE-MM: Sequence-Based Prediction of Mutation-Induced Stability Changes with Feature-Based Multiple Models.

    PubMed

    Folkman, Lukas; Stantic, Bela; Sattar, Abdul; Zhou, Yaoqi

    2016-03-27

    Protein engineering and characterisation of non-synonymous single nucleotide variants (SNVs) require accurate prediction of protein stability changes (ΔΔGu) induced by single amino acid substitutions. Here, we have developed a new prediction method called Evolutionary, Amino acid, and Structural Encodings with Multiple Models (EASE-MM), which comprises five specialised support vector machine (SVM) models and makes the final prediction from a consensus of two models selected based on the predicted secondary structure and accessible surface area of the mutated residue. The new method is applicable to single-domain monomeric proteins and can predict ΔΔGu with a protein sequence and mutation as the only inputs. EASE-MM yielded a Pearson correlation coefficient of 0.53-0.59 in 10-fold cross-validation and independent testing and was able to outperform other sequence-based methods. When compared to structure-based energy functions, EASE-MM achieved a comparable or better performance. The application to a large dataset of human germline non-synonymous SNVs showed that the disease-causing variants tend to be associated with larger magnitudes of ΔΔGu predicted with EASE-MM. The EASE-MM web-server is available at http://sparks-lab.org/server/ease. PMID:26804571

  14. Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction.

    PubMed

    Mathé, C; Peresetsky, A; Déhais, P; Van Montagu, M; Rouzé, P

    1999-02-01

    While genomic sequences are accumulating, finding the location of the genes remains a major issue that can be solved only for about a half of them by homology searches. Prediction methods are thus required, but unfortunately are not fully satisfying. Most prediction methods implicitly assume a unique model for genes. This is an oversimplification as demonstrated by the possibility to group coding sequences into several classes in Escherichia coli and other genomes. As no classification existed for Arabidopsis thaliana, we classified genes according to the statistical features of their coding sequences. A clustering algorithm using a codon usage model was developed and applied to coding sequences from A. thaliana, E. coli, and a mixture of both. By using it, Arabidopsis sequences were clustered into two classes. The CU1 and CU2 classes differed essentially by the choice of pyrimidine bases at the codon silent sites: CU2 genes often use C whereas CU1 genes prefer T. This classification discriminated the Arabidopsis genes according to their expressiveness, highly expressed genes being clustered in CU2 and genes expected to have a lower expression, such as the regulatory genes, in CU1. The algorithm separated the sequences of the Escherichia-Arabidopsis mixed data set into five classes according to the species, except for one class. This mixed class contained 89 % Arabidopsis genes from CU1 and 11 % E. coli genes, mostly horizontally transferred. Interestingly, most genes encoding organelle-targeted proteins, except the photosynthetic and photoassimilatory ones, were clustered in CU1. By tailoring the GeneMark CDS prediction algorithm to the observed coding sequence classes, its quality of prediction was greatly improved. Similar improvement can be expected with other prediction systems. PMID:9925779

  15. Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words.

    PubMed

    Santoni, Daniele; Felici, Giovanni; Vergni, Davide

    2016-02-21

    Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones. PMID:26656109

  16. Transcriptome Sequencing in Response to Salicylic Acid in Salvia miltiorrhiza

    PubMed Central

    Zhang, Xiaoru; Dong, Juane; Liu, Hailong; Wang, Jiao; Qi, Yuexin; Liang, Zongsuo

    2016-01-01

    Salvia miltiorrhiza is a traditional Chinese herbal medicine, whose quality and yield are often affected by diseases and environmental stresses during its growing season. Salicylic acid (SA) plays a significant role in plants responding to biotic and abiotic stresses, but the involved regulatory factors and their signaling mechanisms are largely unknown. In order to identify the genes involved in SA signaling, the RNA sequencing (RNA-seq) strategy was employed to evaluate the transcriptional profiles in S. miltiorrhiza cell cultures. A total of 50,778 unigenes were assembled, in which 5,316 unigenes were differentially expressed among 0-, 2-, and 8-h SA induction. The up-regulated genes were mainly involved in stimulus response and multi-organism process. A core set of candidate novel genes coding SA signaling component proteins was identified. Many transcription factors (e.g., WRKY, bHLH and GRAS) and genes involved in hormone signal transduction were differentially expressed in response to SA induction. Detailed analysis revealed that genes associated with defense signaling, such as antioxidant system genes, cytochrome P450s and ATP-binding cassette transporters, were significantly overexpressed, which can be used as genetic tools to investigate disease resistance. Our transcriptome analysis will help understand SA signaling and its mechanism of defense systems in S. miltiorrhiza. PMID:26808150

  17. Transcriptome Sequencing in Response to Salicylic Acid in Salvia miltiorrhiza.

    PubMed

    Zhang, Xiaoru; Dong, Juane; Liu, Hailong; Wang, Jiao; Qi, Yuexin; Liang, Zongsuo

    2016-01-01

    Salvia miltiorrhiza is a traditional Chinese herbal medicine, whose quality and yield are often affected by diseases and environmental stresses during its growing season. Salicylic acid (SA) plays a significant role in plants responding to biotic and abiotic stresses, but the involved regulatory factors and their signaling mechanisms are largely unknown. In order to identify the genes involved in SA signaling, the RNA sequencing (RNA-seq) strategy was employed to evaluate the transcriptional profiles in S. miltiorrhiza cell cultures. A total of 50,778 unigenes were assembled, in which 5,316 unigenes were differentially expressed among 0-, 2-, and 8-h SA induction. The up-regulated genes were mainly involved in stimulus response and multi-organism process. A core set of candidate novel genes coding SA signaling component proteins was identified. Many transcription factors (e.g., WRKY, bHLH and GRAS) and genes involved in hormone signal transduction were differentially expressed in response to SA induction. Detailed analysis revealed that genes associated with defense signaling, such as antioxidant system genes, cytochrome P450s and ATP-binding cassette transporters, were significantly overexpressed, which can be used as genetic tools to investigate disease resistance. Our transcriptome analysis will help understand SA signaling and its mechanism of defense systems in S. miltiorrhiza. PMID:26808150

  18. Accuracy of genomic prediction using imputed whole-genome sequence data in white layers.

    PubMed

    Heidaritabar, M; Calus, M P L; Megens, H-J; Vereijken, A; Groenen, M A M; Bastiaansen, J W M

    2016-06-01

    There is an increasing interest in using whole-genome sequence data in genomic selection breeding programmes. Prediction of breeding values is expected to be more accurate when whole-genome sequence is used, because the causal mutations are assumed to be in the data. We performed genomic prediction for the number of eggs in white layers using imputed whole-genome resequence data including ~4.6 million SNPs. The prediction accuracies based on sequence data were compared with the accuracies from the 60 K SNP panel. Predictions were based on genomic best linear unbiased prediction (GBLUP) as well as a Bayesian variable selection model (BayesC). Moreover, the prediction accuracy from using different types of variants (synonymous, non-synonymous and non-coding SNPs) was evaluated. Genomic prediction using the 60 K SNP panel resulted in a prediction accuracy of 0.74 when GBLUP was applied. With sequence data, there was a small increase (~1%) in prediction accuracy over the 60 K genotypes. With both 60 K SNP panel and sequence data, GBLUP slightly outperformed BayesC in predicting the breeding values. Selection of SNPs more likely to affect the phenotype (i.e. non-synonymous SNPs) did not improve the accuracy of genomic prediction. The fact that sequence data were based on imputation from a small number of sequenced animals may have limited the potential to improve the prediction accuracy. A small reference population (n = 1004) and possible exclusion of many causal SNPs during quality control can be other possible reasons for limited benefit of sequence data. We expect, however, that the limited improvement is because the 60 K SNP panel was already sufficiently dense to accurately determine the relationships between animals in our data. PMID:26776363

  19. Genome Sequence Analysis of the Naphthenic Acid Degrading and Metal Resistant Bacterium Cupriavidus gilardii CR3.

    PubMed

    Wang, Xiaoyu; Chen, Meili; Xiao, Jingfa; Hao, Lirui; Crowley, David E; Zhang, Zhewen; Yu, Jun; Huang, Ning; Huo, Mingxin; Wu, Jiayan

    2015-01-01

    Cupriavidus sp. are generally heavy metal tolerant bacteria with the ability to degrade a variety of aromatic hydrocarbon compounds, although the degradation pathways and substrate versatilities remain largely unknown. Here we studied the bacterium Cupriavidus gilardii strain CR3, which was isolated from a natural asphalt deposit, and which was shown to utilize naphthenic acids as a sole carbon source. Genome sequencing of C. gilardii CR3 was carried out to elucidate possible mechanisms for the naphthenic acid biodegradation. The genome of C. gilardii CR3 was composed of two circular chromosomes chr1 and chr2 of respectively 3,539,530 bp and 2,039,213 bp in size. The genome for strain CR3 encoded 4,502 putative protein-coding genes, 59 tRNA genes, and many other non-coding genes. Many genes were associated with xenobiotic biodegradation and metal resistance functions. Pathway prediction for degradation of cyclohexanecarboxylic acid, a representative naphthenic acid, suggested that naphthenic acid undergoes initial ring-cleavage, after which the ring fission products can be degraded via several plausible degradation pathways including a mechanism similar to that used for fatty acid oxidation. The final metabolic products of these pathways are unstable or volatile compounds that were not toxic to CR3. Strain CR3 was also shown to have tolerance to at least 10 heavy metals, which was mainly achieved by self-detoxification through ion efflux, metal-complexation and metal-reduction, and a powerful DNA self-repair mechanism. Our genomic analysis suggests that CR3 is well adapted to survive the harsh environment in natural asphalts containing naphthenic acids and high concentrations of heavy metals. PMID:26301592

  20. Genome Sequence Analysis of the Naphthenic Acid Degrading and Metal Resistant Bacterium Cupriavidus gilardii CR3

    PubMed Central

    Xiao, Jingfa; Hao, Lirui; Crowley, David E.; Zhang, Zhewen; Yu, Jun; Huang, Ning; Huo, Mingxin; Wu, Jiayan

    2015-01-01

    Cupriavidus sp. are generally heavy metal tolerant bacteria with the ability to degrade a variety of aromatic hydrocarbon compounds, although the degradation pathways and substrate versatilities remain largely unknown. Here we studied the bacterium Cupriavidus gilardii strain CR3, which was isolated from a natural asphalt deposit, and which was shown to utilize naphthenic acids as a sole carbon source. Genome sequencing of C. gilardii CR3 was carried out to elucidate possible mechanisms for the naphthenic acid biodegradation. The genome of C. gilardii CR3 was composed of two circular chromosomes chr1 and chr2 of respectively 3,539,530 bp and 2,039,213 bp in size. The genome for strain CR3 encoded 4,502 putative protein-coding genes, 59 tRNA genes, and many other non-coding genes. Many genes were associated with xenobiotic biodegradation and metal resistance functions. Pathway prediction for degradation of cyclohexanecarboxylic acid, a representative naphthenic acid, suggested that naphthenic acid undergoes initial ring-cleavage, after which the ring fission products can be degraded via several plausible degradation pathways including a mechanism similar to that used for fatty acid oxidation. The final metabolic products of these pathways are unstable or volatile compounds that were not toxic to CR3. Strain CR3 was also shown to have tolerance to at least 10 heavy metals, which was mainly achieved by self-detoxification through ion efflux, metal-complexation and metal-reduction, and a powerful DNA self-repair mechanism. Our genomic analysis suggests that CR3 is well adapted to survive the harsh environment in natural asphalts containing naphthenic acids and high concentrations of heavy metals. PMID:26301592

  1. Predicting Salmonella enterica serotypes by repetitive sequence-based PCR

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Repetitive extragenic palindromic sequence-based PCR (rep-PCR) utilizing a semi-automated system, was evaluated as a method to determine Salmonella serotypes. A group of 216 Salmonella isolates belonging to 13 frequently isolated serotypes and one rarer serotype from poultry were used to create a D...

  2. OrfPredictor: predicting protein-coding regions in EST-derived sequences.

    PubMed

    Min, Xiang Jia; Butler, Gregory; Storms, Reginald; Tsang, Adrian

    2005-07-01

    OrfPredictor is a web server designed for identifying protein-coding regions in expressed sequence tag (EST)-derived sequences. For query sequences with a hit in BLASTX, the program predicts the coding regions based on the translation reading frames identified in BLASTX alignments, otherwise, it predicts the most probable coding region based on the intrinsic signals of the query sequences. The output is the predicted peptide sequences in the FASTA format, and a definition line that includes the query ID, the translation reading frame and the nucleotide positions where the coding region begins and ends. OrfPredictor facilitates the annotation of EST-derived sequences, particularly, for large-scale EST projects. OrfPredictor is available at https://fungalgenome.concordia.ca/tools/OrfPredictor.html. PMID:15980561

  3. New layers in understanding and predicting α-linolenic acid content in plants using amino acid characteristics of omega-3 fatty acid desaturase.

    PubMed

    Zinati, Zahra; Zamansani, Fatemeh; Hossein KayvanJoo, Amir; Ebrahimi, Mahdi; Ebrahimi, Mansour; Ebrahimie, Esmaeil; Mohammadi Dehcheshmeh, Manijeh

    2014-11-01

    α-linolenic acid (ALA) is the most frequent omega-3 in plants. The content of ALA is highly variable, ranging from 0 to 1% in rice and corn to >50% in perilla and flax. ALA production is strongly correlated with the enzymatic activity of omega-3 fatty acid desaturase. To unravel the underlying mechanisms of omega-3 diversity, 895 protein features of omega-3 fatty acid desaturase were compared between plants with high and low omega-3. Attribute weighting showed that this enzyme in plants with high omega-3 content has higher amounts of Lys, Lys-Phe, and Pro-Asn but lower Aliphatic index, Gly-His, and Pro-Leu. The Random Forest model with Accuracy criterion when run on the dataset pre-filtered with Info Gain algorithm was the best model in distinguishing high omega-3 content based on the frequency of Lys-Lys in the structure of fatty acid desaturase. Interestingly, the discriminant function algorithm could predict the level of omega-3 only based on the six important selected attributes (out of 895 protein attributes) of fatty acid desaturase with 75% accuracy. We developed "Plant omega3 predictor" to predict the content of α-linolenic acid based on structural features of omega-3 fatty acid desaturase. The software calculates the 6 key structural protein features from imported Fasta sequence of omega-3 fatty acid desaturase or utilizes the imported features and predicts the ALA content using discriminant function formula. This work unravels an underpinning mechanism of omega-3 diversity via discovery of the key protein attributes in the structure of omega-3 desaturase offering a new approach to obtain higher omega-3 content. PMID:25199845

  4. SeqAPASS (Sequence Alignment to Predict Across Species Susceptibility) software and documentation

    EPA Science Inventory

    SeqAPASS is a software application facilitates rapid and streamlined, yet transparent, comparisons of the similarity of toxicologically-significant molecular targets across species. The present application facilitates analysis of primary amino acid sequence similarity (including ...

  5. Sequence and structure-based prediction of fructosyltransferase activity for functional subclassification of fungal GH32 enzymes.

    PubMed

    Trollope, Kim M; van Wyk, Niël; Kotjomela, Momo A; Volschenk, Heinrich

    2015-12-01

    Sucrolytic enzymes catalyse sucrose hydrolysis or the synthesis of fructooligosaccharides (FOSs), a prebiotic in human and animal nutrition. FOS synthesis capacity differs between sucrolytic enzymes. Amino-acid-sequence-based classification of FOS synthesizing enzymes would greatly facilitate the in silico identification of novel catalysts, as large amounts of sequence data lie untapped. The development of a bioinformatics tool to rapidly distinguish between high-level FOSs synthesizing predominantly sucrose hydrolysing enzymes from fungal genomic data is presented. Sequence comparison of functionally characterized enzymes displaying low- and high-level FOS synthesis revealed conserved motifs unique to each group. New light is shed on the sequence context of active site residues in three previously identified conserved motifs. We characterized two enzymes predicted to possess low- and high-level FOS synthesis activities based on their conserved motif sequences. FOS data for the enzymes confirmed our successful prediction of their FOS synthesis capacity. Structural comparison of enzymes displaying low- and high-level FOS synthesis identified steric hindrance between nystose and a long loop region present only in low-level FOS synthesizers. This loop is proposed to limit the synthesis of FOS species with higher degrees of polymerization, a phenomenon observed among enzymes displaying low-level FOS synthesis. Conserved sequence motifs surrounding catalytic residues and a distant structural determinant were identifiers of FOS synthesis capacity and allow for functional annotation of sucrolytic enzymes directly from amino acid sequence. The tool presented may also be useful to study the structure-function relationships of β-fructofuranosidases by identifying mutations present in a group of closely related enzymes displaying similar function. PMID:26426731

  6. Characterization of Newcastle disease virus isolates by reverse transcription PCR coupled to direct nucleotide sequencing and development of sequence database for pathotype prediction and molecular epidemiological analysis.

    PubMed Central

    Seal, B S; King, D J; Bennett, J D

    1995-01-01

    Degenerate oligonucleotide primers were synthesized to amplify nucleotide sequences from portions of the fusion protein and matrix protein genes of Newcastle disease virus (NDV) genomic RNA that could be used diagnostically. These primers were used in a single-tube reverse transcription PCR of NDV genomic RNA coupled to direct nucleotide sequencing of the amplified product to characterize more than 30 NDV isolates. In agreement with previous reports, differences in the fusion protein cleavage sequence that correlated genotypically with virulence among various NDV pathotypes were detected. By using sequences generated from the matrix protein gene coding for the nuclear localization signal, lentogenic viruses were again grouped phylogenetically separate from other pathotypes. These techniques were applied to compare neurotropic velogenic viruses isolated from an outbreak of Newcastle disease in cormorants and turkeys. Cormorant NDV isolates and an NDV isolate from an infected turkey flock in North Dakota had the fusion protein cleavage sequence 109SRGRRQKRFVG119. The R-for-G substitution at position 110 may be unique for the cormorant-type isolates. Although the amino acid sequences from the fusion protein cleavage site were identical, nucleotide sequence data correlate the outbreak in turkeys to a cormorant virus isolate from Minnesota and not to a cormorant virus isolate from Michigan. On the basis of sequence information, the cormorant isolates are virulent viruses related to isolates of psittacine origin, possibly genotypically distinct from other velogenic NDV isolates. These techniques can be used reliably for Newcastle disease epidemiology and for prediction of pathotypes of NDV isolates without traditional live-bird inoculations. PMID:8567895

  7. SIFT Indel: Predictions for the Functional Effects of Amino Acid Insertions/Deletions in Proteins

    PubMed Central

    Hu, Jing; Ng, Pauline C.

    2013-01-01

    Indels in the coding regions of a gene can either cause frameshifts or amino acid insertions/deletions. Frameshifting indels are indels that have a length that is not divisible by 3 and subsequently cause frameshifts. Indels that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions; we call these 3n indels. The new amino acid changes resulting from 3n indels could potentially affect protein function. Therefore, we construct a SIFT Indel prediction algorithm for 3n indels which achieves 82% accuracy, 81% sensitivity, 82% specificity, 82% precision, 0.63 MCC, and 0.87 AUC by 10-fold cross-validation. We have previously published a prediction algorithm for frameshifting indels. The rules for the prediction of 3n indels are different from the rules for the prediction of frameshifting indels and reflect the biological differences of these two different types of variations. SIFT Indel was applied to human 3n indels from the 1000 Genomes Project and the Exome Sequencing Project. We found that common variants are less likely to be deleterious than rare variants. The SIFT indel prediction algorithm for 3n indels is available at http://sift-dna.org/ PMID:24194902

  8. Next Generation Sequencing in Predicting Gene Function in Podophyllotoxin Biosynthesis*

    PubMed Central

    Marques, Joaquim V.; Kim, Kye-Won; Lee, Choonseok; Costa, Michael A.; May, Gregory D.; Crow, John A.; Davin, Laurence B.; Lewis, Norman G.

    2013-01-01

    Podophyllum species are sources of (−)-podophyllotoxin, an aryltetralin lignan used for semi-synthesis of various powerful and extensively employed cancer-treating drugs. Its biosynthetic pathway, however, remains largely unknown, with the last unequivocally demonstrated intermediate being (−)-matairesinol. Herein, massively parallel sequencing of Podophyllum hexandrum and Podophyllum peltatum transcriptomes and subsequent bioinformatics analyses of the corresponding assemblies were carried out. Validation of the assembly process was first achieved through confirmation of assembled sequences with those of various genes previously established as involved in podophyllotoxin biosynthesis as well as other candidate biosynthetic pathway genes. This contribution describes characterization of two of the latter, namely the cytochrome P450s, CYP719A23 from P. hexandrum and CYP719A24 from P. peltatum. Both enzymes were capable of converting (−)-matairesinol into (−)-pluviatolide by catalyzing methylenedioxy bridge formation and did not act on other possible substrates tested. Interestingly, the enzymes described herein were highly similar to methylenedioxy bridge-forming enzymes from alkaloid biosynthesis, whereas candidates more similar to lignan biosynthetic enzymes were catalytically inactive with the substrates employed. This overall strategy has thus enabled facile further identification of enzymes putatively involved in (−)-podophyllotoxin biosynthesis and underscores the deductive power of next generation sequencing and bioinformatics to probe and deduce medicinal plant biosynthetic pathways. PMID:23161544

  9. Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition.

    PubMed

    Shi, J-Y; Zhang, S-W; Pan, Q; Cheng, Y-M; Xie, J

    2007-07-01

    As more and more genomes have been discovered in recent years, there is an urgent need to develop a reliable method to predict the subcellular localization for the explosion of newly found proteins. However, many well-known prediction methods based on amino acid composition have problems utilizing the sequence-order information. Here, based on the concept of Chou's pseudo amino acid composition (PseAA), a new feature extraction method, the multi-scale energy (MSE) approach, is introduced to incorporate the sequence-order information. First, a protein sequence was mapped to a digital signal using the amino acid index. Then, by wavelet transform, the mapped signal was broken down into several scales in which the energy factors were calculated and further formed into an MSE feature vector. Following this, combining this MSE feature vector with amino acid composition (AA), we constructed a series of MSEPseAA feature vectors to represent the protein subcellular localization sequences. Finally, according to a new kind of normalization approach, the MSEPseAA feature vectors were normalized to form the improved MSEPseAA vectors, named as IEPseAA. Using the technique of IEPseAA, C-support vector machine (C-SVM) and three multi-class SVMs strategies, quite promising results were obtained, indicating that MSE is quite effective in reflecting the sequence-order effects and might become a useful tool for predicting the other attributes of proteins as well. PMID:17235454

  10. Detection and isolation of nucleic acid sequences using a bifunctional hybridization probe

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    2000-01-01

    A method for detecting and isolating a target sequence in a sample of nucleic acids is provided using a bifunctional hybridization probe capable of hybridizing to the target sequence that includes a detectable marker and a first complexing agent capable of forming a binding pair with a second complexing agent. A kit is also provided for detecting a target sequence in a sample of nucleic acids using a bifunctional hybridization probe according to this method.

  11. Servers for sequence-structure relationship analysis and prediction.

    PubMed

    Dosztányi, Zsuzsanna; Magyar, Csaba; Tusnády, Gábor E; Cserzo, Miklós; Fiser, András; Simon, István

    2003-07-01

    We describe several algorithms and public servers that were developed to analyze and predict various features of protein structures. These servers provide information about the covalent state of cysteine (CYSREDOX), as well as about residues involved in non-covalent cross links that play an important role in the structural stability of proteins (SCIDE and SCPRED). We also discuss methods and servers developed to identify helical transmembrane proteins from large databases and rough genomic data, including two of the most popular transmembrane prediction methods, DAS and HMMTOP. Several biologically interesting applications of these servers are also presented. The servers are available through http://www.enzim.hu/servers.html. PMID:12824327

  12. Predicting RNA secondary structures from sequence and probing data.

    PubMed

    Lorenz, Ronny; Wolfinger, Michael T; Tanzer, Andrea; Hofacker, Ivo L

    2016-07-01

    RNA secondary structures have proven essential for understanding the regulatory functions performed by RNA such as microRNAs, bacterial small RNAs, or riboswitches. This success is in part due to the availability of efficient computational methods for predicting RNA secondary structures. Recent advances focus on dealing with the inherent uncertainty of prediction by considering the ensemble of possible structures rather than the single most stable one. Moreover, the advent of high-throughput structural probing has spurred the development of computational methods that incorporate such experimental data as auxiliary information. PMID:27064083

  13. Affinity regression predicts the recognition code of nucleic acid binding proteins

    PubMed Central

    Pelossof, Raphael; Singh, Irtisha; Yang, Julie L.; Weirauch, Matthew T.; Hughes, Timothy R.; Leslie, Christina S.

    2016-01-01

    Predicting the affinity profiles of nucleic acid-binding proteins directly from the protein sequence is a major unsolved problem. We present a statistical approach for learning the recognition code of a family of transcription factors (TFs) or RNA-binding proteins (RBPs) from high-throughput binding assays. Our method, called affinity regression, trains on protein binding microarray (PBM) or RNA compete experiments to learn an interaction model between proteins and nucleic acids, using only protein domain and probe sequences as inputs. By training on mouse homeodomain PBM profiles, our model correctly identifies residues that confer DNA-binding specificity and accurately predicts binding motifs for an independent set of divergent homeodomains. Similarly, learning from RNA compete profiles for diverse RBPs, our model can predict the binding affinities of held-out proteins and identify key RNA-binding residues. More broadly, we envision applying our method to model and predict biological interactions in any setting where there is a high-throughput ‘affinity’ readout. PMID:26571099

  14. First draft genome sequencing of indole acetic acid producing and plant growth promoting fungus Preussia sp. BSL10.

    PubMed

    Khan, Abdul Latif; Asaf, Sajjad; Khan, Abdur Rahim; Al-Harrasi, Ahmed; Al-Rawahi, Ahmed; Lee, In-Jung

    2016-05-10

    Preussia sp. BSL10, family Sporormiaceae, was actively producing phytohormone (indole-3-acetic acid) and extra-cellular enzymes (phosphatases and glucosidases). The fungus was also promoting the growth of arid-land tree-Boswellia sacra. Looking at such prospects of this fungus, we sequenced its draft genome for the first time. The Illumina based sequence analysis reveals an approximate genome size of 31.4Mbp for Preussia sp. BSL10. Based on ab initio gene prediction, total 32,312 coding sequences were annotated consisting of 11,967 coding genes, pseudogenes, and 221 tRNA genes. Furthermore, 321 carbohydrate-active enzymes were predicted and classified into many functional families. PMID:26995610

  15. Structure Prediction and Analysis of Neuraminidase Sequence Variants

    ERIC Educational Resources Information Center

    Thayer, Kelly M.

    2016-01-01

    Analyzing protein structure has become an integral aspect of understanding systems of biochemical import. The laboratory experiment endeavors to introduce protein folding to ascertain structures of proteins for which the structure is unavailable, as well as to critically evaluate the quality of the prediction obtained. The model system used is the…

  16. Sequence-based feature prediction and annotation of proteins

    PubMed Central

    Juncker, Agnieszka S; Jensen, Lars J; Pierleoni, Andrea; Bernsel, Andreas; Tress, Michael L; Bork, Peer; von Heijne, Gunnar; Valencia, Alfonso; Ouzounis, Christos A; Casadio, Rita; Brunak, Søren

    2009-01-01

    A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome. PMID:19226438

  17. Prediction of Out-of-Sequence Development by BSID Scores.

    ERIC Educational Resources Information Center

    Richards, Ruth C.; And Others

    The primary purpose of this study was to examine uneven early development in premature infants. A multiple regression analysis was performed in which birth weight, length of gestation, length of assisted feeding, and length of ventilation were used to predict the descrepancy between a child's Psychomotor and Mental Scale scores on the Bayley…

  18. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.

    PubMed

    Yin, Changchuan

    2015-04-01

    To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences. PMID:25491390

  19. Isolation and amino acid sequences of squirrel monkey (Saimiri sciurea) insulin and glucagon.

    PubMed Central

    Yu, J H; Eng, J; Yalow, R S

    1990-01-01

    It was reported two decades ago that insulin was not detectable in the glucose-stimulated state in Saimiri sciurea, the New World squirrel monkey, by a radioimmunoassay system developed with guinea pig anti-pork insulin antibody and labeled pork insulin. With the same system, reasonable levels were observed in rhesus monkeys and chimpanzees. This suggested that New World monkeys, like the New World hystricomorph rodents such as the guinea pig and the coypu, might have insulins whose sequences differ markedly from those of Old World mammals. In this report we describe the purification and amino acid sequences of squirrel monkey insulin and glucagon. We demonstrate that the substitutions at B29, B27, A2, A4, and A17 of squirrel monkey insulin are identical with those previously found in another New World primate, the owl monkey (Aotus trivirgatus). The immunologic cross-reactivity of this insulin in our immunoassay system is only a few percent of that of human insulin. Squirrel monkey glucagon is identical with the usual glucagon found in Old World mammals, which predicts that the glucagons of other New World monkeys would not differ from the usual Old World mammalian glucagon. It appears that the peptides of the New World monkeys have diverged less from those of the Old World mammals than have those of the New World hystricomorph rodents. The striking improvements in peptide purification and sequencing have the potential for adding new information concerning the evolutionary divergence of species. PMID:2263627

  20. Complete amino acid sequence of the medium-chain S-acyl fatty acid synthetase thio ester hydrolase from rat mammary gland

    SciTech Connect

    Randhawa, Z.I.; Smith, S.

    1987-03-10

    The complete amino acid sequence of the medium-chain S-acyl fatty acid synthetase thio ester hydrolase (thioesterase II) from rat mammary gland is presented. Most of the sequence was derived by analysis of (/sup 14/C)-labelled peptide fragments produced by cleavage at methionyl, glutamyl, lysyl, arginyl, and tryptophanyl residues. A small section of the sequence was deduced from a previously analyzed cDNA clone. The protein consists of 260 residues and has a blocked amino-terminal methionine and calculated M/sub r/ of 29,212. The carboxy-terminal sequence, verified by Edman degradation of the carboxy-terminal cyanogen bromide fragment and carboxypeptidase Y digestion of the intact thioesterase II, terminates with a serine residue and lacks three additional residues predicted by the cDNA sequence. The native enzyme contains three cysteine residues but no disulfide bridges. The active site serine residue is located at position 101. The rat mammary gland thioesterase II exhibits approximately 40% homology with a thioesterase from mallard uropygial gland, the sequence of which was recently determined by cDNA analysis. Thus the two enzymes may share similar structural features and a common evolutionary origin. The location of the active site in these thioesterases differs from that of other serine active site esterases; indeed, the enzymes do not exhibit any significant homology with other serine esterases, suggesting that they may constitute a separate new family of serine active site enzymes.

  1. Prediction of G protein-coupled receptor encoding sequences from the synganglion transcriptome of the cattle tick, Rhipicephalus microplus.

    PubMed

    Guerrero, Felix D; Kellogg, Anastasia; Ogrey, Alexandria N; Heekin, Andrew M; Barrero, Roberto; Bellgard, Matthew I; Dowd, Scot E; Leung, Ming-Ying

    2016-07-01

    The cattle tick, Rhipicephalus (Boophilus) microplus, is a pest which causes multiple health complications in cattle. The G protein-coupled receptor (GPCR) super-family presents a candidate target for developing novel tick control methods. However, GPCRs share limited sequence similarity among orthologous family members, and there is no reference genome available for R. microplus. This limits the effectiveness of alignment-dependent methods such as BLAST and Pfam for identifying GPCRs from R. microplus. However, GPCRs share a common structure consisting of seven transmembrane helices. We present an analysis of the R. microplus synganglion transcriptome using a combination of structurally-based and alignment-free methods which supplement the identification of GPCRs by sequence similarity. TMHMM predicts the number of transmembrane helices in a protein sequence. GPCRpred is a support vector machine-based method developed to predict and classify GPCRs using the dipeptide composition of a query amino acid sequence. These two bioinformatic tools were applied to our transcriptome assembly of the cattle tick synganglion. Together, BLAST and Pfam identified 85 unique contigs as encoding partial or full length candidate cattle tick GPCRs. Collectively, TMHMM and GPCRpred identified 27 additional GPCR candidates that BLAST and Pfam missed. This demonstrates that the addition of structurally-based and alignment-free bioinformatic approaches to transcriptome annotation and analysis produces a greater collection of prospective GPCRs than an analysis based solely upon methodologies dependent upon sequence alignment and similarity. PMID:26922323

  2. DNA sequencing and predictions of the cosmic theory of life

    NASA Astrophysics Data System (ADS)

    Wickramasinghe, N. Chandra

    2013-01-01

    The theory of cometary panspermia, developed by the late Sir Fred Hoyle and the present author argues that life originated cosmically as a unique event in one of a great multitude of comets or planetary bodies in the Universe. Life on Earth did not originate here but was introduced by impacting comets, and its further evolution was driven by the subsequent acquisition of cosmically derived genes. Explicit predictions of this theory published in 1979-1981, stating how the acquisition of new genes drives evolution, are compared with recent developments in relation to horizontal gene transfer, and the role of retroviruses in evolution. Precisely-stated predictions of the theory of cometary panspermia are shown to have been verified.

  3. Paroxysmal atrial fibrillation prediction method with shorter HRV sequences.

    PubMed

    Boon, K H; Khalil-Hani, M; Malarvili, M B; Sia, C W

    2016-10-01

    This paper proposes a method that predicts the onset of paroxysmal atrial fibrillation (PAF), using heart rate variability (HRV) segments that are shorter than those applied in existing methods, while maintaining good prediction accuracy. PAF is a common cardiac arrhythmia that increases the health risk of a patient, and the development of an accurate predictor of the onset of PAF is clinical important because it increases the possibility to stabilize (electrically) and prevent the onset of atrial arrhythmias with different pacing techniques. We investigate the effect of HRV features extracted from different lengths of HRV segments prior to PAF onset with the proposed PAF prediction method. The pre-processing stage of the predictor includes QRS detection, HRV quantification and ectopic beat correction. Time-domain, frequency-domain, non-linear and bispectrum features are then extracted from the quantified HRV. In the feature selection, the HRV feature set and classifier parameters are optimized simultaneously using an optimization procedure based on genetic algorithm (GA). Both full feature set and statistically significant feature subset are optimized by GA respectively. For the statistically significant feature subset, Mann-Whitney U test is used to filter non-statistical significance features that cannot pass the statistical test at 20% significant level. The final stage of our predictor is the classifier that is based on support vector machine (SVM). A 10-fold cross-validation is applied in performance evaluation, and the proposed method achieves 79.3% prediction accuracy using 15-minutes HRV segment. This accuracy is comparable to that achieved by existing methods that use 30-minutes HRV segments, most of which achieves accuracy of around 80%. More importantly, our method significantly outperforms those that applied segments shorter than 30 minutes. PMID:27480743

  4. Prediction of multi-drug resistance transporters using a novel sequence analysis method

    PubMed Central

    McDermott, Jason E.; Bruillard, Paul; Overall, Christopher C.; Gosink, Luke; Lindemann, Stephen R.

    2015-01-01

    There are many examples of groups of proteins that have similar function, but the determinants of functional specificity may be hidden by lack of sequence similarity, or by large groups of similar sequences with different functions. Transporters are one such protein group in that the general function, transport, can be easily inferred from the sequence, but the substrate specificity can be impossible to predict from sequence with current methods. In this paper we describe a linguistic-based approach to identify functional patterns from groups of unaligned protein sequences and its application to predict multi-drug resistance transporters (MDRs) from bacteria. We first show that our method can recreate known patterns from PROSITE for several motifs from unaligned sequences. We then show that the method, MDRpred, can predict MDRs with greater accuracy and positive predictive value than a collection of currently available family-based models from the Pfam database. Finally, we apply MDRpred to a large collection of protein sequences from an environmental microbiome study to make novel predictions about drug resistance in a potential environmental reservoir. PMID:26913187

  5. Partial amino acid sequence of human factor D:homology with serine proteases.

    PubMed Central

    Volanakis, J E; Bhown, A; Bennett, J C; Mole, J E

    1980-01-01

    Human factor D purified to homogeneity by a modified procedure was subjected to NH2-terminal amino acid sequence analysis by using a modified automated Beckman sequencer. We identified 48 of the first 57 NH2-terminal amino acids in a single sequencer run, using microgram quantities of factor D. The deduced amino acid sequence represents approximately 25% of the primary structure of factor D. This extended NH2-terminal amino acid sequence of factor D was compared to that of other trypsin-related serine proteases. By visual inspection, strong homologies (33--50% identity) were observed with all the serine proteases included in the comparison. Interestingly, factor D showed a higher degree of homology to serine proteases of pancreatic origin than to those of serum origin. Images PMID:6987665

  6. Amino acid sequence of Japanese quail (Coturnix japonica) and northern bobwhite (Colinus virginianus) myoglobin.

    PubMed

    Goodson, John; Beckstead, Robert B; Payne, Jason; Singh, Rakesh K; Mohan, Anand

    2015-08-15

    Myoglobin has an important physiological role in vertebrates, and as the primary sarcoplasmic pigment in meat, influences quality perception and consumer acceptability. In this study, the amino acid sequences of Japanese quail and northern bobwhite myoglobin were deduced by cDNA cloning of the coding sequence from mRNA. Japanese quail myoglobin was isolated from quail cardiac muscles, purified using ammonium sulphate precipitation and gel-filtration, and subjected to multiple enzymatic digestions. Mass spectrometry corroborated the deduced protein amino acid sequence at the protein level. Sequence analysis revealed both species' myoglobin structures consist of 153 amino acids, differing at only three positions. When compared with chicken myoglobin, Japanese quail showed 98% sequence identity, and northern bobwhite 97% sequence identity. The myoglobin in both quail species contained eight histidine residues instead of the nine present in chicken and turkey. PMID:25794748

  7. Amino acid sequence analysis and characterization of a ribonuclease from starfish Asterias amurensis.

    PubMed

    Motoyoshi, Naomi; Kobayashi, Hiroko; Itagaki, Tadashi; Inokuchi, Norio

    2016-09-01

    The aim of this study was to phylogenetically characterize the location of the RNase T2 enzyme in the starfish (Asterias amurensis). We isolated an RNase T2 ribonuclease (RNase Aa) from the ovaries of starfish and determined its amino acid sequence by protein chemistry and cloning cDNA encoding RNase Aa. The isolated protein had 231 amino acid residues, a predicted molecular mass of 25,906 Da, and an optimal pH of 5.0. RNase Aa preferentially released guanylic acid from the RNA. The catalytic sites of the RNase T2 family are conserved in RNase Aa; furthermore, the distribution of the cysteine residues in RNase Aa is similar to that in other animal and plant T2 RNases. RNase Aa is cleaved at two points: 21 residues from the N-terminus and 29 residues from the C-terminus; however, both fragments may remain attached to the protein via disulfide bridges, leading to the maintenance of its conformation, as suggested by circular dichroism spectrum analysis. The phylogenetic analysis revealed that starfish RNase Aa is evolutionarily an intermediate between protozoan and oyster RNases. PMID:26920046

  8. Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory.

    PubMed

    Xiaohui, Niu; Nana, Li; Jingbo, Xia; Dingyan, Chen; Yuehua, Peng; Yang, Xiao; Weiquan, Wei; Dongming, Wang; Zengzhen, Wang

    2013-09-01

    Protein solubility plays a major role and has strong implication in the proteomics. Predicting the propensity of a protein to be soluble or to form inclusion body is a fundamental and not fairly resolved problem. In order to predict the protein solubility, almost 10,000 protein sequences were downloaded from NCBI. Then the sequences were eliminated for the high homologous similarity by CD-HIT. Thus, there were 5692 sequences remained. Based on protein sequences, amino acid and dipeptide compositions were generally extracted to predict protein solubility. In this study, the entropy in information theory was introduced as another predictive factor in the model. Experiments involving nine different feature vector combinations, including the above-mentioned three kinds of factors, were conducted with support vector machines (SVMs) as prediction engine. Each combination was evaluated by re-substitution test and 10-fold cross-validation test. According to the evaluation results, the accuracies and Matthew's Correlation Coefficient (MCC) values were boosted by the introduction of the entropy. The best combination was the one with amino acid, dipeptide compositions and their entropies. Its accuracy reached 90.34% and Matthew's Correlation Coefficient (MCC) value was 0.7494 in re-substitution test, while 88.12% and 0.7945 respectively for 10-fold cross-validation. In conclusion, the introduction of the entropy significantly improved the performance of the predictive method. PMID:23524162

  9. Identification of random nucleic acid sequence aberrations using dual capture probes which hybridize to different chromosome regions

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    1998-01-01

    A method is provided for detecting nucleic acid sequence aberrations using two immobilization steps. According to the method, a nucleic acid sequence aberration is detected by detecting nucleic acid sequences having both a first nucleic acid sequence type (e.g., from a first chromosome) and a second nucleic acid sequence type (e.g., from a second chromosome), the presence of the first and the second nucleic acid sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. In the method, immobilization of a first hybridization probe is used to isolate a first set of nucleic acids in the sample which contain the first nucleic acid sequence type. Immobilization of a second hybridization probe is then used to isolate a second set of nucleic acids from within the first set of nucleic acids which contain the second nucleic acid sequence type. The second set of nucleic acids are then detected, their presence indicating the presence of a nucleic acid sequence aberration.

  10. Identification of random nucleic acid sequence aberrations using dual capture probes which hybridize to different chromosome regions

    DOEpatents

    Lucas, J.N.; Straume, T.; Bogen, K.T.

    1998-03-24

    A method is provided for detecting nucleic acid sequence aberrations using two immobilization steps. According to the method, a nucleic acid sequence aberration is detected by detecting nucleic acid sequences having both a first nucleic acid sequence type (e.g., from a first chromosome) and a second nucleic acid sequence type (e.g., from a second chromosome), the presence of the first and the second nucleic acid sequence type on the same nucleic acid sequence indicating the presence of a nucleic acid sequence aberration. In the method, immobilization of a first hybridization probe is used to isolate a first set of nucleic acids in the sample which contain the first nucleic acid sequence type. Immobilization of a second hybridization probe is then used to isolate a second set of nucleic acids from within the first set of nucleic acids which contain the second nucleic acid sequence type. The second set of nucleic acids are then detected, their presence indicating the presence of a nucleic acid sequence aberration. 14 figs.

  11. Prediction of influenza B vaccine effectiveness from sequence data.

    PubMed

    Pan, Yidan; Deem, Michael W

    2016-08-31

    Influenza is a contagious respiratory illness that causes significant human morbidity and mortality, affecting 5-15% of the population in a typical epidemic season. Human influenza epidemics are caused by types A and B, with roughly 25% of human cases due to influenza B. Influenza B is a single-stranded RNA virus with a high mutation rate, and both prior immune history and vaccination put significant pressure on the virus to evolve. Due to the high rate of viral evolution, the influenza B vaccine component of the annual influenza vaccine is updated, roughly every other year in recent years. To predict when an update to the vaccine is needed, an estimate of expected vaccine effectiveness against a range of viral strains is required. We here introduce a method to measure antigenic distance between the influenza B vaccine and circulating viral strains. The measure correlates well with effectiveness of the influenza B component of the annual vaccine in humans between 1979 and 2014. We discuss how this measure of antigenic distance may be used in the context of annual influenza vaccine design and prediction of vaccine effectiveness. PMID:27473305

  12. Predicting effects of noncoding variants with deep learning-based sequence model.

    PubMed

    Zhou, Jian; Troyanskaya, Olga G

    2015-10-01

    Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning-based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants. PMID:26301843

  13. Predicting effects of noncoding variants with deep learning–based sequence model

    PubMed Central

    Zhou, Jian; Troyanskaya, Olga G

    2016-01-01

    Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning–based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants. PMID:26301843

  14. The amino acid sequence of protein CM-3 from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J

    1985-01-01

    Protein CM-3 from Dendroaspis polylepis polylepis venom was purified by gel filtration and ion exchange chromatography. It comprises 65 amino acids including eight half-cystines. The complete amino acid sequence of protein CM-3 has been elucidated. The sequence (residues 1-50) resembles that of the N-terminal sequence of the subunits of a synergistic type protein and residues 51-65 that of the C-terminal sequence of an angusticeps type protein. Mixtures of protein CM-3 and angusticeps type proteins showed no apparent synergistic effect, in that their toxicity in combination was no greater than the sum of their individual toxicities. PMID:4029488

  15. A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence.

    PubMed

    Rice, D W; Eisenberg, D

    1997-04-11

    In protein fold recognition, a probe amino acid sequence is compared to a library of representative folds of known structure to identify a structural homolog. In cases where the probe and its homolog have clear sequence similarity, traditional residue substitution matrices have been used to predict the structural similarity. In cases where the probe is sequentially distant from its homolog, we have developed a (7 x 3 x 2 x 7 x 3) 3D-1D substitution matrix (called H3P2), calculated from a database of 119 structural pairs. Members of each pair share a similar fold, but have sequence identity less than 30%. Each probe sequence position is defined by one of seven residue classes and three secondary structure classes. Each homologous fold position is defined by one of seven residue classes, three secondary structure classes, and two burial classes. Thus the matrix is five-dimensional and contains 7 x 3 x 2 x 7 x 3 = 882 elements or 3D-1D scores. The first step in assigning a probe sequence to its homologous fold is the prediction of the three-state (helix, strand, coil) secondary structure of the probe; here we use the profile based neural network prediction of secondary structure (PHD) program. Then a dynamic programming algorithm uses the H3P2 matrix to align the probe sequence with structures in a representative fold library. To test the effectiveness of the H3P2 matrix a challenging, fold class diverse, and cross-validated benchmark assessment is used to compare the H3P2 matrix to the GONNET, PAM250, BLOSUM62 and a secondary structure only substitution matrix. For distantly related sequences the H3P2 matrix detects more homologous structures at higher reliabilities than do these other substitution matrices, based on sensitivity versus specificity plots (or SENS-SPEC plots). The added efficacy of the H3P2 matrix arises from its information on the statistical preferences for various sequence-structure environment combinations from very distantly related proteins. It

  16. cDNA cloning and structural characterization of a lectin from the mussel Crenomytilus grayanus with a unique amino acid sequence and antibacterial activity.

    PubMed

    Kovalchuk, Svetlana N; Chikalovets, Irina V; Chernikov, Oleg V; Molchanova, Valentina I; Li, Wei; Rasskazov, Valery A; Lukyanov, Pavel A

    2013-10-01

    An amino acid sequence of GalNAc/Gal-specific lectin from the mussel Crenomytilus grayanus (CGL) was determined by cDNA sequencing. CGL consists of 150 amino acid residues, contains three tandem repeats with high sequence similarities to each other (up to 73%) and does not belong to any known lectins family. According to circular dichroism results CGL is a β/α-protein with the predominance of β-structure. CGL was predicted to adopt a ß-trefoil fold. The lectin exhibits antibacterial activity and might be involved in the recognition and clearance of bacterial pathogens in the shellfish. PMID:23886951

  17. Interrogating and predicting tolerated sequence diversity in protein folds: application to E. elaterium trypsin inhibitor-II cystine-knot miniprotein.

    PubMed

    Lahti, Jennifer L; Silverman, Adam P; Cochran, Jennifer R

    2009-09-01

    Cystine-knot miniproteins (knottins) are promising molecular scaffolds for protein engineering applications. Members of the knottin family have multiple loops capable of displaying conformationally constrained polypeptides for molecular recognition. While previous studies have illustrated the potential of engineering knottins with modified loop sequences, a thorough exploration into the tolerated loop lengths and sequence space of a knottin scaffold has not been performed. In this work, we used the Ecballium elaterium trypsin inhibitor II (EETI) as a model member of the knottin family and constructed libraries of EETI loop-substituted variants with diversity in both amino acid sequence and loop length. Using yeast surface display, we isolated properly folded EETI loop-substituted clones and applied sequence analysis tools to assess the tolerated diversity of both amino acid sequence and loop length. In addition, we used covariance analysis to study the relationships between individual positions in the substituted loops, based on the expectation that correlated amino acid substitutions will occur between interacting residue pairs. We then used the results of our sequence and covariance analyses to successfully predict loop sequences that facilitated proper folding of the knottin when substituted into EETI loop 3. The sequence trends we observed in properly folded EETI loop-substituted clones will be useful for guiding future protein engineering efforts with this knottin scaffold. Furthermore, our findings demonstrate that the combination of directed evolution with sequence and covariance analyses can be a powerful tool for rational protein engineering. PMID:19730675

  18. Disjoint combinations profiling (DCP): a new method for the prediction of antibody CDR conformation from sequence

    PubMed Central

    Nikoloudis, Dimitris; Pitts, Jim E.

    2014-01-01

    The accurate prediction of the conformation of Complementarity-Determining Regions (CDRs) is important in modelling antibodies for protein engineering applications. Specifically, the Canonical paradigm has proved successful in predicting the CDR conformation in antibody variable regions. It relies on canonical templates which detail allowed residues at key positions in the variable region framework or in the CDR itself for 5 of the 6 CDRs. While no templates have as yet been defined for the hypervariable CDR-H3, instead, reliable sequence rules have been devised for predicting the base of the CDR-H3 loop. Here a new method termed Disjoint Combinations Profiling (DCP) is presented, which contributes a considerable advance in the prediction of CDR conformations. This novel method is explained and compared with canonical templates and sequence rules in a 3-way blind prediction. DCP achieved 93% accuracy over 951 blind predictions and showed an improvement in cumulative accuracy compared to predictions with canonical templates or sequence rules. In addition to its overall improvement in prediction accuracy, it is suggested that DCP is open to better implementations in the future and that it can improve as more antibody structures are deposited in the databank. In contrast, it is argued that canonical templates and sequence rules may have reached their peak. PMID:25071985

  19. Disjoint combinations profiling (DCP): a new method for the prediction of antibody CDR conformation from sequence.

    PubMed

    Nikoloudis, Dimitris; Pitts, Jim E; Saldanha, José W

    2014-01-01

    The accurate prediction of the conformation of Complementarity-Determining Regions (CDRs) is important in modelling antibodies for protein engineering applications. Specifically, the Canonical paradigm has proved successful in predicting the CDR conformation in antibody variable regions. It relies on canonical templates which detail allowed residues at key positions in the variable region framework or in the CDR itself for 5 of the 6 CDRs. While no templates have as yet been defined for the hypervariable CDR-H3, instead, reliable sequence rules have been devised for predicting the base of the CDR-H3 loop. Here a new method termed Disjoint Combinations Profiling (DCP) is presented, which contributes a considerable advance in the prediction of CDR conformations. This novel method is explained and compared with canonical templates and sequence rules in a 3-way blind prediction. DCP achieved 93% accuracy over 951 blind predictions and showed an improvement in cumulative accuracy compared to predictions with canonical templates or sequence rules. In addition to its overall improvement in prediction accuracy, it is suggested that DCP is open to better implementations in the future and that it can improve as more antibody structures are deposited in the databank. In contrast, it is argued that canonical templates and sequence rules may have reached their peak. PMID:25071985

  20. [Cloning of full-length coding sequence of tree shrew CD4 and prediction of its molecular characteristics].

    PubMed

    Tian, Wei-Wei; Gao, Yue-Dong; Guo, Yan; Huang, Jing-Fei; Xiao, Chang; Li, Zuo-Sheng; Zhang, Hua-Tang

    2012-02-01

    The tree shrews, as an ideal animal model receiving extensive attentions to human disease research, demands essential research tools, in particular cellular markers and monoclonal antibodies for immunological studies. In this paper, a 1 365 bp of the full-length CD4 cDNA encoding sequence was cloned from total RNA in peripheral blood of tree shrews, the sequence completes two unknown fragment gaps of tree shrews predicted CD4 cDNA in the GenBank database, and its molecular characteristics were analyzed compared with other mammals by using biology software such as Clustal W2.0 and so forth. The results showed that the extracellular and intracellular domains of tree shrews CD4 amino acid sequence are conserved. The tree shrews CD4 amino acid sequence showed a close genetic relationship with Homo sapiens and Macaca mulatta. Most regions of the tree shrews CD4 molecule surface showed positive charges as humans. However, compared with CD4 extracellular domain D1 of human, CD4 D1 surface of tree shrews showed more negative charges, and more two N-glycosylation sites, which may affect antibody binding. This study provides a theoretical basis for the preparation and functional studies of CD4 monoclonal antibody. PMID:22345010

  1. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering

    PubMed Central

    Kelley, David R.; Liu, Bo; Delcher, Arthur L.; Pop, Mihai; Salzberg, Steven L.

    2012-01-01

    Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested. PMID:22102569

  2. Sequence Prediction With Sparse Distributed Hyperdimensional Coding Applied to the Analysis of Mobile Phone Use Patterns.

    PubMed

    Rasanen, Okko J; Saarinen, Jukka P

    2016-09-01

    Modeling and prediction of temporal sequences is central to many signal processing and machine learning applications. Prediction based on sequence history is typically performed using parametric models, such as fixed-order Markov chains ( n -grams), approximations of high-order Markov processes, such as mixed-order Markov models or mixtures of lagged bigram models, or with other machine learning techniques. This paper presents a method for sequence prediction based on sparse hyperdimensional coding of the sequence structure and describes how higher order temporal structures can be utilized in sparse coding in a balanced manner. The method is purely incremental, allowing real-time online learning and prediction with limited computational resources. Experiments with prediction of mobile phone use patterns, including the prediction of the next launched application, the next GPS location of the user, and the next artist played with the phone media player, reveal that the proposed method is able to capture the relevant variable-order structure from the sequences. In comparison with the n -grams and the mixed-order Markov models, the sparse hyperdimensional predictor clearly outperforms its peers in terms of unweighted average recall and achieves an equal level of weighted average recall as the mixed-order Markov chain but without the batch training of the mixed-order model. PMID:26285224

  3. The Chinese hamster Alu-equivalent sequence: a conserved highly repetitious, interspersed deoxyribonucleic acid sequence in mammals has a structure suggestive of a transposable element.

    PubMed Central

    Haynes, S R; Toomey, T P; Leinwand, L; Jelinek, W R

    1981-01-01

    A consensus sequence has been determined for a major interspersed deoxyribonucleic acid repeat in the genome of Chinese hamster ovary cells (CHO cells). This sequence is extensively homologous to (i) the human Alu sequence (P. L. Deininger et al., J. Mol. Biol., in press), (ii) the mouse B1 interspersed repetitious sequence (Krayev et al., Nucleic Acids Res. 8:1201-1215, 1980) (iii) an interspersed repetitious sequence from African green monkey deoxyribonucleic acid (Dhruva et al., Proc. Natl. Acad. Sci. U.S.A. 77:4514-4518, 1980) and (iv) the CHO and mouse 4.5S ribonucleic acid (this report; F. Harada and N. Kato, Nucleic Acids Res. 8:1273-1285, 1980). Because the CHO consensus sequence shows significant homology to the human Alu sequence it is termed the CHO Alu-equivalent sequence. A conserved structure surrounding CHO Alu-equivalent family members can be recognized. It is similar to that surrounding the human Alu and the mouse B1 sequences, and is represented as follows: direct repeat-CHO-Alu-A-rich sequence-direct repeat. A composite interspersed repetitious sequence has been identified. Its structure is represented as follows: direct repeat-residue 47 to 107 of CHO-Alu-non-Alu repetitious sequence-A-rich sequence-direct repeat. Because the Alu flanking sequences resemble those that flank known transposable elements, we think it likely that the Alu sequence dispersed throughout the mammalian genome by transposition. Images PMID:9279371

  4. Assessing a novel approach for predicting local 3D protein structures from sequence.

    PubMed

    Benros, Cristina; de Brevern, Alexandre G; Etchebest, Catherine; Hazout, Serge

    2006-03-01

    We developed a novel approach for predicting local protein structure from sequence. It relies on the Hybrid Protein Model (HPM), an unsupervised clustering method we previously developed. This model learns three-dimensional protein fragments encoded into a structural alphabet of 16 protein blocks (PBs). Here, we focused on 11-residue fragments encoded as a series of seven PBs and used HPM to cluster them according to their local similarities. We thus built a library of 120 overlapping prototypes (mean fragments from each cluster), with good three-dimensional local approximation, i.e., a mean accuracy of 1.61 A Calpha root-mean-square distance. Our prediction method is intended to optimize the exploitation of the sequence-structure relations deduced from this library of long protein fragments. This was achieved by setting up a system of 120 experts, each defined by logistic regression to optimize the discrimination from sequence of a given prototype relative to the others. For a target sequence window, the experts computed probabilities of sequence-structure compatibility for the prototypes and ranked them, proposing the top scorers as structural candidates. Predictions were defined as successful when a prototype <2.5 A from the true local structure was found among those proposed. Our strategy yielded a prediction rate of 51.2% for an average of 4.2 candidates per sequence window. We also proposed a confidence index to estimate prediction quality. Our approach predicts from sequence alone and will thus provide valuable information for proteins without structural homologs. Candidates will also contribute to global structure prediction by fragment assembly. PMID:16385557

  5. Computer Simulation of the Determination of Amino Acid Sequences in Polypeptides

    ERIC Educational Resources Information Center

    Daubert, Stephen D.; Sontum, Stephen F.

    1977-01-01

    Describes a computer program that generates a random string of amino acids and guides the student in determining the correct sequence of a given protein by using experimental analytic data for that protein. (MLH)

  6. Nucleotide sequence of the fadR gene, a multifunctional regulator of fatty acid metabolism in Escherichia coli.

    PubMed Central

    DiRusso, C C

    1988-01-01

    The Escherichia coli fadR gene is a multifunctional regulator of fatty acid and acetate metabolism. In the present work the nucleotide sequence of the 1.3 kb DNA fragment which encodes FadR has been determined. The coding sequence of the fadR gene is 714 nucleotides long and is preceded by a typical E. coli ribosome binding site and is followed by a sequence predicted to be sufficient for factor-independent chain termination. Primer extension experiments demonstrated that the transcription of the fadR gene initiates with an adenine nucleotide 33 nucleotides upstream from the predicted start of translation. The derived fadR peptide has a calculated molecular weight of 26,972. This is in reasonable agreement with the apparent molecular weight of 29,000 previously estimated on the basis of maxi-cell analysis of plasmid encoded proteins. There is a segment of twenty amino acids within the predicted peptide which resembles the DNA recognition and binding site of many transcriptional regulatory proteins. Images PMID:2843809

  7. Characterization of mouse cellular deoxyribonucleic acid homologous to Abelson murine leukemia virus-specific sequences.

    PubMed Central

    Dale, B; Ozanne, B

    1981-01-01

    The genome of Abelson murine leukemia virus (A-MuLV) consists of sequences derived from both BALB/c mouse deoxyribonucleic acid and the genome of Moloney murine leukemia virus. Using deoxyribonucleic acid linear intermediates as a source of retroviral deoxyribonucleic acid, we isolated a recombinant plasmid which contained 1.9 kilobases of the 3.5-kilobase mouse-derived sequences found in A-MuLV (A-MuLV-specific sequences). We used this clone, designated pSA-17, as a probe restriction enzyme and Southern blot analyses to examine the arrangement of homologous sequences in BALB/c deoxyribonucleic acid (endogenous Abelson sequences). The endogenous Abelson sequences within the mouse genome were interrupted by noncoding regions, suggesting that a rearrangement of the cell sequences was required to produce the sequence found in the virus. Endogenous Abelson sequences were arranged similarly in mice that were susceptible to A-MuLV tumors and in mice that were resistant to A-MuLV tumors. An examination of three BALB/c plasmacytomas and a BALB/c early B-cell tumor likewise revealed no alteration in the arrangement of the endogenous Abelson sequences. Homology to pSA-17 was also observed in deoxyribonucleic acids prepared from rat, hamster, chicken, and human cells. An isolate of A-MuLV which encoded a 160,000-dalton transforming protein (P160) contained 700 more base pairs of mouse sequences than the standard A-MuLV isolate, which encoded a 120,000-dalton transforming protein (P120). Images PMID:9279386

  8. The amino acid sequence of monal pheasant lysozyme and its activity.

    PubMed

    Araki, T; Matsumoto, T; Torikata, T

    1998-10-01

    The amino acid sequence of monal pheasant lysozyme and its activity were analyzed. Carboxymethylated lysozyme was digested with trypsin and the resulting peptides were sequenced. The established amino acid sequence had one amino acid substitution at position 102 (Arg to Gly) comparing with Indian peafowl lysozyme and four amino acid substitutions at positions 3 (Phe to Tyr), 15 (His to Leu), 41 (Gln to His), and 121 (Gln to His) with chicken lysozyme. Analysis of the time-courses of reaction using N-acetylglucosamine pentamer as a substrate showed a difference of binding free energy change (-0.4 kcal/mol) at subsites A between monal pheasant and Indian peafowl lysozyme. This was assumed to be caused by the amino acid substitution at subsite A with loss of a positive charge at position 102 (Arg102 to Gly). PMID:9836434

  9. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets

    PubMed Central

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S.; Beer, Michael A.

    2013-01-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167–80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org. PMID:23771147

  10. Studies on monotreme proteins. VII. Amino acid sequence of myoglobin from the platypus, Ornithoryhynchus anatinus.

    PubMed

    Fisher, W K; Thompson, E O

    1976-03-01

    Myoglobin isolated from skeletal muscle of the platypus contains 153 amino acid residues. The complete amino acid sequence has been determined following cleavage with cyanogen bromide and further digestion of the four fragments with trypsin, chymotrypsin, pepsin and thermolysin. Sequences of the purified peptides were determined by the dansyl-Edman procedure. The amino acid sequence showed 25 differences from human myoglobin and 24 from kangaroo myoglobin. Amino acid sequences in myoglobins are more conserved than sequences in the alpha- and beta-globin chains, and platypus myoglobin shows a similar number of variations in sequence to kangaroo myoglobin when compared with myoglobin of other species. The date of divergence of the platypus from other mammals was estimated at 102 +/- 31 million years, based on the number of amino acid differences between species and allowing for mutations during the evolutionary period. This estimate differs widely from the estimate given by similar treatment of the alpha- and beta-chain sequences and a constant rate of mutation of globin chains is not supported. PMID:962722

  11. cDNA-derived amino acid sequences of myoglobins from nine species of whales and dolphins.

    PubMed

    Iwanami, Kentaro; Mita, Hajime; Yamamoto, Yasuhiko; Fujise, Yoshihiro; Yamada, Tadasu; Suzuki, Tomohiko

    2006-10-01

    We determined the myoglobin (Mb) cDNA sequences of nine cetaceans, of which six are the first reports of Mb sequences: sei whale (Balaenoptera borealis), Bryde's whale (Balaenoptera edeni), pygmy sperm whale (Kogia breviceps), Stejneger's beaked whale (Mesoplodon stejnegeri), Longman's beaked whale (Indopacetus pacificus), and melon-headed whale (Peponocephala electra), and three confirm the previously determined chemical amino acid sequences: sperm whale (Physeter macrocephalus), common minke whale (Balaenoptera acutorostrata) and pantropical spotted dolphin (Stenella attenuata). We found two types of Mb in the skeletal muscle of pantropical spotted dolphin: Mb I with the same amino acid sequence as that deposited in the protein database, and Mb II, which differs at two amino acid residues compared with Mb I. Using an alignment of the amino acid or cDNA sequences of cetacean Mb, we constructed a phylogenetic tree by the NJ method. Clustering of cetacean Mb amino acid and cDNA sequences essentially follows the classical taxonomy of cetaceans, suggesting that Mb sequence data is valid for classification of cetaceans at least to the family level. PMID:16962803

  12. Complete Genome Sequence of Enterococcus mundtii QU 25, an Efficient l-(+)-Lactic Acid-Producing Bacterium

    PubMed Central

    Shiwa, Yuh; Yanase, Hiroaki; Hirose, Yuu; Satomi, Shohei; Araya-Kojima, Tomoko; Watanabe, Satoru; Zendo, Takeshi; Chibazakura, Taku; Shimizu-Kadota, Mariko; Yoshikawa, Hirofumi; Sonomoto, Kenji

    2014-01-01

    Enterococcus mundtii QU 25, a non-dairy bacterial strain of ovine faecal origin, can ferment both cellobiose and xylose to produce l-lactic acid. The use of this strain is highly desirable for economical l-lactate production from renewable biomass substrates. Genome sequence determination is necessary for the genetic improvement of this strain. We report the complete genome sequence of strain QU 25, primarily determined using Pacific Biosciences sequencing technology. The E. mundtii QU 25 genome comprises a 3 022 186-bp single circular chromosome (GC content, 38.6%) and five circular plasmids: pQY182, pQY082, pQY039, pQY024, and pQY003. In all, 2900 protein-coding sequences, 63 tRNA genes, and 6 rRNA operons were predicted in the QU 25 chromosome. Plasmid pQY024 harbours genes for mundticin production. We found that strain QU 25 produces a bacteriocin, suggesting that mundticin-encoded genes on plasmid pQY024 were functional. For lactic acid fermentation, two gene clusters were identified—one involved in the initial metabolism of xylose and uptake of pentose and the second containing genes for the pentose phosphate pathway and uptake of related sugars. This is the first complete genome sequence of an E. mundtii strain. The data provide insights into lactate production in this bacterium and its evolution among enterococci. PMID:24568933

  13. Prediction of antibiotic resistance proteins from sequence-derived properties irrespective of sequence similarity.

    PubMed

    Zhang, H L; Lin, H H; Tao, L; Ma, X H; Dai, J L; Jia, J; Cao, Z W

    2008-09-01

    Increasing antibiotic resistance has become a worldwide challenge to the clinical treatment of infectious diseases. The identification of antibiotic resistance proteins (ARPs) would be helpful in the discovery of new therapeutic targets and the design of novel drugs to control the potential spread of antibiotic resistance. In this work, a support vector machine (SVM)-based ARP prediction system was developed using 1308 ARPs and 15587 non-ARPs. Its performance was evaluated using 313 ARPs and 7156 non-ARPs. The computed prediction accuracy was 88.5% for ARPs and 99.2% for non-ARPs. A potential application of this method is the identification of ARPs non-homologous to proteins of known function. Further genome screening found that ca. 3.5% and 3.2% of proteins in Escherichia coli and Staphylococcus aureus, respectively, are potential ARPs. These results suggest the usefulness of SVMs for facilitating the identification of ARPs. The software can be accessed at SARPI (Server for Antibiotic Resistance Protein Identification). PMID:18583101

  14. Severe accident source term characteristics for selected Peach Bottom sequences predicted by the MELCOR Code

    SciTech Connect

    Carbajo, J.J.

    1993-09-01

    The purpose of this report is to compare in-containment source terms developed for NUREG-1159, which used the Source Term Code Package (STCP), with those generated by MELCOR to identify significant differences. For this comparison, two short-term depressurized station blackout sequences (with a dry cavity and with a flooded cavity) and a Loss-of-Coolant Accident (LOCA) concurrent with complete loss of the Emergency Core Cooling System (ECCS) were analyzed for the Peach Bottom Atomic Power Station (a BWR-4 with a Mark I containment). The results indicate that for the sequences analyzed, the two codes predict similar total in-containment release fractions for each of the element groups. However, the MELCOR/CORBH Package predicts significantly longer times for vessel failure and reduced energy of the released material for the station blackout sequences (when compared to the STCP results). MELCOR also calculated smaller releases into the environment than STCP for the station blackout sequences.

  15. Comparison of Predicted Scaffold-Compatible Sequence Variation in the Triple-Hairpin Structure of Human Immunodeficiency Virus Type 1 gp41 with Patient Data

    PubMed Central

    Boutonnet, Nathalie; Janssens, Wouter; Boutton, Carlo; Verschelde, Jean-Luc; Heyndrickx, Leo; Beirnaert, Els; van der Groen, Guido; Lasters, Ignace

    2002-01-01

    It has been proposed that the ectodomain of human immunodeficiency virus type 1 (HIV-1) gp41 (e-gp41), involved in HIV entry into the target cell, exists in at least two conformations, a pre-hairpin intermediate and a fusion-active hairpin structure. To obtain more information on the structure-sequence relationship in e-gp41, we performed in silico a full single-amino-acid substitution analysis, resulting in a Fold Compatible Database (FCD) for each conformation. The FCD contains for each residue position in a given protein a list of values assessing the energetic compatibility (ECO) of each of the 20 natural amino acids at that position. Our results suggest that FCD predictions are in good agreement with the sequence variation observed for well-validated e-gp41 sequences. The data show that at a minECO threshold value of 5 kcal/mol, about 90% of the observed patient sequence variation is encompassed by the FCD predictions. Some inconsistent FCD predictions at N-helix positions packing against residues of the C helix suggest that packing of both peptides may involve some flexibility and may be attributed to an altered orientation of the C-helical domain versus the N-helical region. The permissiveness of sequence variation in the C helices is in agreement with FCD predictions. Comparison of N-core and triple-hairpin FCDs suggests that the N helices may impose more constraints on sequence variation than the C helices. Although the observed sequences of e-gp41 contain many multiple mutations, our method, which is based on single-point mutations, can predict the natural sequence variability of e-gp41 very well. PMID:12097573

  16. Draft Genome Sequences of Two Novel Acidimicrobiaceae Members from an Acid Mine Drainage Biofilm Metagenome.

    PubMed

    Pinto, Ameet J; Sharp, Jonathan O; Yoder, Michael J; Almstrand, Robert

    2016-01-01

    Bacteria belonging to the family Acidimicrobiaceae are frequently encountered in heavy metal-contaminated acidic environments. However, their phylogenetic and metabolic diversity is poorly resolved. We present draft genome sequences of two novel and phylogenetically distinct Acidimicrobiaceae members assembled from an acid mine drainage biofilm metagenome. PMID:26769942

  17. Draft Genome Sequences of Two Novel Acidimicrobiaceae Members from an Acid Mine Drainage Biofilm Metagenome

    PubMed Central

    Pinto, Ameet J.; Sharp, Jonathan O.; Yoder, Michael J.

    2016-01-01

    Bacteria belonging to the family Acidimicrobiaceae are frequently encountered in heavy metal-contaminated acidic environments. However, their phylogenetic and metabolic diversity is poorly resolved. We present draft genome sequences of two novel and phylogenetically distinct Acidimicrobiaceae members assembled from an acid mine drainage biofilm metagenome. PMID:26769942

  18. Amino acid sequence homology between Piv, an essential protein in site-specific DNA inversion in Moraxella lacunata, and transposases of an unusual family of insertion elements.

    PubMed Central

    Lenich, A G; Glasgow, A C

    1994-01-01

    Deletion analysis of the subcloned DNA inversion region of Moraxella lacunata indicates that Piv is the only M. lacunata-encoded factor required for site-specific inversion of the tfpQ/tfpI pilin segment. The predicted amino acid sequence of Piv shows significant homology solely with the transposases/integrases of a family of insertion sequence elements, suggesting that Piv is a novel site-specific recombinase. Images PMID:8021196

  19. αIIbβ3 variants defined by next-generation sequencing: Predicting variants likely to cause Glanzmann thrombasthenia

    PubMed Central

    Buitrago, Lorena; Rendon, Augusto; Liang, Yupu; Simeoni, Ilenia; Negri, Ana; Filizola, Marta; Ouwehand, Willem H.; Coller, Barry S.; Alessi, Marie-Christine; Ballmaier, Matthias; Bariana, Tadbir; Bellissimo, Daniel; Bertoli, Marta; Bray, Paul; Bury, Loredana; Carrell, Robin; Cattaneo, Marco; Collins, Peter; French, Deborah; Favier, Remi; Freson, Kathleen; Furie, Bruce; Germeshausen, Manuela; Ghevaert, Cedric; Gomez, Keith; Goodeve, Anne; Gresele, Paolo; Guerrero, Jose; Hampshire, Dan J.; Hadinnapola, Charaka; Heemskerk, Johan; Henskens, Yvonne; Hill, Marian; Hogg, Nancy; Johnsen, Jill; Kahr, Walter; Kerr, Ron; Kunishima, Shinji; Laffan, Michael; Natwani, Amit; Neerman-Arbez, Marguerite; Nurden, Paquita; Nurden, Alan; Ormiston, Mark; Othman, Maha; Ouwehand, Willem; Perry, David; Vilk, Shoshana Ravel; Reitsma, Pieter; Rondina, Matthew; Simeoni, Ilenia; Smethurst, Peter; Stephens, Jonathan; Stevenson, William; Szkotak, Artur; Turro, Ernest; Van Geet, Christel; Vries, Minka; Ward, June; Waye, John; Westbury, Sarah; Whiteheart, Sidney; Wilcox, David; Zhang, Bi

    2015-01-01

    Next-generation sequencing is transforming our understanding of human genetic variation but assessing the functional impact of novel variants presents challenges. We analyzed missense variants in the integrin αIIbβ3 receptor subunit genes ITGA2B and ITGB3 identified by whole-exome or -genome sequencing in the ThromboGenomics project, comprising ∼32,000 alleles from 16,108 individuals. We analyzed the results in comparison with 111 missense variants in these genes previously reported as being associated with Glanzmann thrombasthenia (GT), 20 associated with alloimmune thrombocytopenia, and 5 associated with aniso/macrothrombocytopenia. We identified 114 novel missense variants in ITGA2B (affecting ∼11% of the amino acids) and 68 novel missense variants in ITGB3 (affecting ∼9% of the amino acids). Of the variants, 96% had minor allele frequencies (MAF) < 0.1%, indicating their rarity. Based on sequence conservation, MAF, and location on a complete model of αIIbβ3, we selected three novel variants that affect amino acids previously associated with GT for expression in HEK293 cells. αIIb P176H and β3 C547G severely reduced αIIbβ3 expression, whereas αIIb P943A partially reduced αIIbβ3 expression and had no effect on fibrinogen binding. We used receiver operating characteristic curves of combined annotation-dependent depletion, Polyphen 2-HDIV, and sorting intolerant from tolerant to estimate the percentage of novel variants likely to be deleterious. At optimal cut-off values, which had 69–98% sensitivity in detecting GT mutations, between 27% and 71% of the novel αIIb or β3 missense variants were predicted to be deleterious. Our data have implications for understanding the evolutionary pressure on αIIbβ3 and highlight the challenges in predicting the clinical significance of novel missense variants. PMID:25827233

  20. αIIbβ3 variants defined by next-generation sequencing: predicting variants likely to cause Glanzmann thrombasthenia.

    PubMed

    Buitrago, Lorena; Rendon, Augusto; Liang, Yupu; Simeoni, Ilenia; Negri, Ana; Filizola, Marta; Ouwehand, Willem H; Coller, Barry S

    2015-04-14

    Next-generation sequencing is transforming our understanding of human genetic variation but assessing the functional impact of novel variants presents challenges. We analyzed missense variants in the integrin αIIbβ3 receptor subunit genes ITGA2B and ITGB3 identified by whole-exome or -genome sequencing in the ThromboGenomics project, comprising ∼32,000 alleles from 16,108 individuals. We analyzed the results in comparison with 111 missense variants in these genes previously reported as being associated with Glanzmann thrombasthenia (GT), 20 associated with alloimmune thrombocytopenia, and 5 associated with aniso/macrothrombocytopenia. We identified 114 novel missense variants in ITGA2B (affecting ∼11% of the amino acids) and 68 novel missense variants in ITGB3 (affecting ∼9% of the amino acids). Of the variants, 96% had minor allele frequencies (MAF) < 0.1%, indicating their rarity. Based on sequence conservation, MAF, and location on a complete model of αIIbβ3, we selected three novel variants that affect amino acids previously associated with GT for expression in HEK293 cells. αIIb P176H and β3 C547G severely reduced αIIbβ3 expression, whereas αIIb P943A partially reduced αIIbβ3 expression and had no effect on fibrinogen binding. We used receiver operating characteristic curves of combined annotation-dependent depletion, Polyphen 2-HDIV, and sorting intolerant from tolerant to estimate the percentage of novel variants likely to be deleterious. At optimal cut-off values, which had 69-98% sensitivity in detecting GT mutations, between 27% and 71% of the novel αIIb or β3 missense variants were predicted to be deleterious. Our data have implications for understanding the evolutionary pressure on αIIbβ3 and highlight the challenges in predicting the clinical significance of novel missense variants. PMID:25827233

  1. EnsemPro: an ensemble approach to predicting transcription start sites in human genomic DNA sequences.

    PubMed

    Won, Hong-Hee; Kim, Min-Ji; Kim, Seonwoo; Kim, Jong-Won

    2008-03-01

    Although several computational methods have been developed to identify transcription start sites (TSSs)/promoters, the computational prediction still needs improvement. Due to low performance, the promoter prediction programs can provide misleading results in functional genomic studies. To improve the prediction accuracy, we propose the use of an ensemble approach, EnsemPro (Ensemble Promoter), which combines the prediction results of the existing promoter predictors. We schematically compared the prediction performance of the currently available promoter prediction programs in an identical evaluating environment, and the results served as a guide for choosing the combined predictors. We applied three representative ensemble schemes-the majority voting, the weighted voting, and the Bayesian approach-for the TSS prediction of hundreds of human genomic sequences. EnsemPro identified the TSSs more precisely than other combining methods as well as the currently available individual predictor programs. The source code of EnsemPro is available on request from the authors. PMID:18164178

  2. Two distinct ferredoxins from Rhodobacter capsulatus: complete amino acid sequences and molecular evolution.

    PubMed

    Saeki, K; Suetsugu, Y; Yao, Y; Horio, T; Marrs, B L; Matsubara, H

    1990-09-01

    Two distinct ferredoxins were purified from Rhodobacter capsulatus SB1003. Their complete amino acid sequences were determined by a combination of protease digestion, BrCN cleavage and Edman degradation. Ferredoxins I and II were composed of 64 and 111 amino acids, respectively, with molecular weights of 6,728 and 12,549 excluding iron and sulfur atoms. Both contained two Cys clusters in their amino acid sequences. The first cluster of ferredoxin I and the second cluster of ferredoxin II had a sequence, CxxCxxCxxxCP, in common with the ferredoxins found in Clostridia. The second cluster of ferredoxin I had a sequence, CxxCxxxxxxxxCxxxCM, with extra amino acids between the second and third Cys, which has been reported for other photosynthetic bacterial ferredoxins and putative ferredoxins (nif-gene products) from nitrogen-fixing bacteria, and with a unique occurrence of Met. The first cluster of ferredoxin II had a CxxCxxxxCxxxCP sequence, with two additional amino acids between the second and third Cys, a characteristics feature of Azotobacter-[3Fe-4S] [4Fe-4S]-ferredoxin. Ferredoxin II was also similar to Azotobacter-type ferredoxins with an extended carboxyl (C-) terminal sequence compared to the common Clostridium-type. The evolutionary relationship of the two together with a putative one recently found to be encoded in nifENXQ region in this bacterium [Moreno-Vivian et al. (1989) J. Bacteriol. 171, 2591-2598] is discussed. PMID:2277040

  3. Amino Acid Sequence of Anionic Peroxidase from the Windmill Palm Tree Trachycarpus fortunei

    PubMed Central

    2015-01-01

    Palm peroxidases are extremely stable and have uncommon substrate specificity. This study was designed to fill in the knowledge gap about the structures of a peroxidase from the windmill palm tree Trachycarpus fortunei. The complete amino acid sequence and partial glycosylation were determined by MALDI-top-down sequencing of native windmill palm tree peroxidase (WPTP), MALDI-TOF/TOF MS/MS of WPTP tryptic peptides, and cDNA sequencing. The propeptide of WPTP contained N- and C-terminal signal sequences which contained 21 and 17 amino acid residues, respectively. Mature WPTP was 306 amino acids in length, and its carbohydrate content ranged from 21% to 29%. Comparison to closely related royal palm tree peroxidase revealed structural features that may explain differences in their substrate specificity. The results can be used to guide engineering of WPTP and its novel applications. PMID:25383699

  4. Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies

    PubMed Central

    Bao, Su-Ying; Yang, Wanling; Ho, Shu-Leong; Song, Yong-Qiang; Sham, Pak C.

    2013-01-01

    Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ∼22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases. PMID:23341771

  5. Predicting mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies.

    PubMed

    Li, Miao-Xin; Kwan, Johnny S H; Bao, Su-Ying; Yang, Wanling; Ho, Shu-Leong; Song, Yong-Qiang; Sham, Pak C

    2013-01-01

    Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ~22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases. PMID:23341771

  6. Protein chemotaxonomy. XIII. Amino acid sequence of ferredoxin from Panax ginseng.

    PubMed

    Mino, Yoshiki

    2006-08-01

    The complete amino acid sequence of [2Fe-2S] ferredoxin from Panax ginseng (Araliaceae) has been determined by automated Edman degradation of the entire S-carboxymethylcysteinyl protein and of the peptides obtained by enzymatic digestion. This ferredoxin has a unique amino acid sequence, which includes an insertion of Tyr at the 3rd position from the amino-terminus and a deletion of two amino acid residues at the carboxyl terminus. This ferredoxin had 18 differences in its amino acid sequence compared to that of Petroselinum sativum (Umbelliferae). In contrast, 23-33 differences were observed compared to other dicotyledonous plants. This suggests that Panax ginseng is related taxonomically to umbelliferous plants. PMID:16880642

  7. Complete amino acid sequence and structure characterization of the taste-modifying protein, miraculin.

    PubMed

    Theerasilp, S; Hitotsuya, H; Nakajo, S; Nakaya, K; Nakamura, Y; Kurihara, Y

    1989-04-25

    The taste-modifying protein, miraculin, has the unusual property of modifying sour taste into sweet taste. The complete amino acid sequence of miraculin purified from miracle fruits by a newly developed method (Theerasilp, S., and Kurihara, Y. (1988) J. Biol. Chem. 263, 11536-11539) was determined by an automatic Edman degradation method. Miraculin was a single polypeptide with 191 amino acid residues. The calculated molecular weight based on the amino acid sequence and the carbohydrate content (13.9%) was 24,600. Asn-42 and Asn-186 were linked N-glycosidically to carbohydrate chains. High homology was found between the amino acid sequences of miraculin and soybean trypsin inhibitor. PMID:2708331

  8. Complete cDNA and derived amino acid sequence of human factor V

    SciTech Connect

    Jenny, R.J.; Pittman, D.D.; Toole, J.J.; Kriz, R.W.; Aldape, R.A.; Hewick, R.M.; Kaufman, R.J.; Mann, K.G.

    1987-07-01

    cDNA clones encoding human factor V have been isolated from an oligo(dT)-primed human fetal liver cDNA library prepared with vector Charon 21A. The cDNA sequence of factor V from three overlapping clones includes a 6672-base-pair (bp) coding region, a 90-bp 5' untranslated region, and a 163-bp 3' untranslated region within which is a poly(A)tail. The deduced amino acid sequence consists of 2224 amino acids inclusive of a 28-amino acid leader peptide. Direct comparison with human factor VIII reveals considerable homology between proteins in amino acid sequence and domain structure: a triplicated A domain and duplicated C domain show approx. 40% identity with the corresponding domains in factor VIII. As in factor VIII, the A domains of factor V share approx. 40% amino acid-sequence homology with the three highly conserved domains in ceruloplasmin. The B domain of factor V contains 35 tandem and approx. 9 additional semiconserved repeats of nine amino acids of the form Asp-Leu-Ser-Gln-Thr-Thr/Asn-Leu-Ser-Pro and 2 additional semiconserved repeats of 17 amino acids. Factor V contains 37 potential N-linked glycosylation sites, 25 of which are in the B domain, and a total of 19 cysteine residues.

  9. N-terminal sequence of amino acids and some properties of an acid-stable alpha-amylase from citric acid-koji (Aspergillus usamii var.).

    PubMed

    Suganuma, T; Tahara, N; Kitahara, K; Nagahama, T; Inuzuka, K

    1996-01-01

    An acid-stable alpha-amylase (AA) was purified from an acidic extract of citric acid-koji (A. usamii var.). The N-terminal sequence of the first 20 amino acids of the enzyme was identical with that of AA from A. niger, but the two enzymes differed in molecular weight. HPLC analysis for identifying the anomers of products indicated that the AA hydrolyzed maltopentaose (G5) at the third glycoside bond predominantly, which differed from Taka-amylase A and the neutral alpha-amylase (NA) from the citric acid-koji. PMID:8824843

  10. All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences

    PubMed Central

    Hayat, Sikander; Sander, Chris; Marks, Debora S.

    2015-01-01

    Transmembrane β-barrels (TMBs) carry out major functions in substrate transport and protein biogenesis but experimental determination of their 3D structure is challenging. Encouraged by successful de novo 3D structure prediction of globular and α-helical membrane proteins from sequence alignments alone, we developed an approach to predict the 3D structure of TMBs. The approach combines the maximum-entropy evolutionary coupling method for predicting residue contacts (EVfold) with a machine-learning approach (boctopus2) for predicting β-strands in the barrel. In a blinded test for 19 TMB proteins of known structure that have a sufficient number of diverse homologous sequences available, this combined method (EVfold_bb) predicts hydrogen-bonded residue pairs between adjacent β-strands at an accuracy of ∼70%. This accuracy is sufficient for the generation of all-atom 3D models. In the transmembrane barrel region, the average 3D structure accuracy [template-modeling (TM) score] of top-ranked models is 0.54 (ranging from 0.36 to 0.85), with a higher (44%) number of residue pairs in correct strand–strand registration than in earlier methods (18%). Although the nonbarrel regions are predicted less accurately overall, the evolutionary couplings identify some highly constrained loop residues and, for FecA protein, the barrel including the structure of a plug domain can be accurately modeled (TM score = 0.68). Lower prediction accuracy tends to be associated with insufficient sequence information and we therefore expect increasing numbers of β-barrel families to become accessible to accurate 3D structure prediction as the number of available sequences increases. PMID:25858953

  11. Prediction strength modulates responses in human area CA1 to sequence violations

    PubMed Central

    Cook, Paul A.; Wagner, Anthony D.

    2015-01-01

    Emerging human, animal, and computational evidence suggest that, within the hippocampus, stored memories are compared with current sensory input to compute novelty, i.e., detecting when inputs deviate from expectations. Hippocampal subfield CA1 is thought to detect mismatches between past and present, and detected novelty is thought to modulate encoding processes, providing a mechanism for gating the entry of information into memory. Using high-resolution functional MRI, we examined human hippocampal subfield and medial temporal lobe cortical activation during prediction violations within a sequence of events unfolding over time. Subjects encountered sequences of four visual stimuli that were then reencountered in the same temporal order (Repeat) or a rearranged order (Violation). Prediction strength was manipulated by varying whether the sequence was initially presented once (Weak) or thrice (Strong) prior to the critical Repeat or Violation sequence. Analyses of blood oxygen level-dependent signals revealed that task-responsive voxels in anatomically defined CA1, CA23/dentate gyrus, and perirhinal cortex were more active when expectations were violated than when confirmed. Additionally, stronger prediction violations elicited greater activity than weaker violations in CA1, and CA1 contained the greatest proportion of voxels displaying this prediction violation pattern relative to other medial temporal lobe regions. Finally, a memory test with a separate group of subjects showed that subsequent recognition memory was superior for items that had appeared in prediction violation trials than in prediction confirmation trials. These findings indicate that CA1 responds to temporal order prediction violations, and that this response is modulated by prediction strength. PMID:26063773

  12. Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids.

    PubMed

    Raicar, Gaurav; Saini, Harsh; Dehzangi, Abdollah; Lal, Sunil; Sharma, Alok

    2016-08-01

    Predicting the three-dimensional (3-D) structure of a protein is an important task in the field of bioinformatics and biological sciences. However, directly predicting the 3-D structure from the primary structure is hard to achieve. Therefore, predicting the fold or structural class of a protein sequence is generally used as an intermediate step in determining the protein's 3-D structure. For protein fold recognition (PFR) and structural class prediction (SCP), two steps are required - feature extraction step and classification step. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physicochemical-based information to extract features. In this study, we explore the importance of utilizing the physicochemical properties of amino acids for improving PFR and SCP accuracies. For this, we propose a Forward Consecutive Search (FCS) scheme which aims to strategically select physicochemical attributes that will supplement the existing feature extraction techniques for PFR and SCP. An exhaustive search is conducted on all the existing 544 physicochemical attributes using the proposed FCS scheme and a subset of physicochemical attributes is identified. Features extracted from these selected attributes are then combined with existing syntactical-based and evolutionary-based features, to show an improvement in the recognition and prediction performance on benchmark datasets. PMID:27164998

  13. cDNA-derived amino-acid sequence of a land turtle (Geochelone carbonaria) beta-chain hemoglobin.

    PubMed

    Bordin, S; Meza, A N; Saad, S T; Ogo, S H; Costa, F F

    1997-06-01

    The cDNA sequence encoding the turtle Geochelone carbonaria beta-chain was determinated. The isolation of hemoglobin mRNA was based on degenerate primers' PCR in combination with 5'- and 3'-RACE protocol. The full length cDNA is 615 bp with the ATG start codon at position 53 and TGA stop codon at position 495; The AATAAA polyadenylation signal is found at position 599. The deduced polypeptyde contains 146 amino-acid residues. The predicted amino acid sequence shares 83% identity with the beta-globin of a related specie, the aquatic turtle C. p. belli. Otherwise, identity is higher when compared with chicken beta-Hb (80%) than with other reptilian orders (Squamata, 69%, and Crocodilia, 61%). Compared with human HbA, there is 67% identity, and at least three amino acid substitutions could be of some functional significance (Glu43 beta-->Ser, His116 beta-->Thr and His143 beta-->Leu). To our knowledge this represents the first cDNA sequence of a reptile globin gene described. PMID:9238523

  14. Plasma long-chain free fatty acids predict mammalian longevity

    PubMed Central

    Jové, Mariona; Naudí, Alba; Aledo, Juan Carlos; Cabré, Rosanna; Ayala, Victoria; Portero-Otin, Manuel; Barja, Gustavo; Pamplona, Reinald

    2013-01-01

    Membrane lipid composition is an important correlate of the rate of aging of animals and, therefore, the determination of their longevity. In the present work, the use of high-throughput technologies allowed us to determine the plasma lipidomic profile of 11 mammalian species ranging in maximum longevity from 3.5 to 120 years. The non-targeted approach revealed a specie-specific lipidomic profile that accurately predicts the animal longevity. The regression analysis between lipid species and longevity demonstrated that the longer the longevity of a species, the lower is its plasma long-chain free fatty acid (LC-FFA) concentrations, peroxidizability index, and lipid peroxidation-derived products content. The inverse association between longevity and LC-FFA persisted after correction for body mass and phylogenetic interdependence. These results indicate that the lipidomic signature is an optimized feature associated with animal longevity, emerging LC-FFA as a potential biomarker of longevity. PMID:24284984

  15. Linguistic and Spatial Skills Predict Early Arithmetic Development via Counting Sequence Knowledge

    ERIC Educational Resources Information Center

    Zhang, Xiao; Koponen, Tuire; Räsänen, Pekka; Aunola, Kaisa; Lerkkanen, Marja-Kristiina; Nurmi, Jari-Erik

    2014-01-01

    Utilizing a longitudinal sample of Finnish children (ages 6-10), two studies examined how early linguistic (spoken vs. written) and spatial skills predict later development of arithmetic, and whether counting sequence knowledge mediates these associations. In Study 1 (N = 1,880), letter knowledge and spatial visualization, measured in…

  16. Combining Structure and Sequence Information Allows Automated Prediction of Substrate Specificities within Enzyme Families

    PubMed Central

    Röttig, Marc; Rausch, Christian; Kohlbacher, Oliver

    2010-01-01

    An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/. PMID:20072606

  17. Predicting Salmonella enterica subsp. enterica Serotypes by Repetitive Extragenic Palindromic Sequence-Based PCR

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The DiversiLabTM System, which employs repetitive extragenic palindromic sequence-based PCR (rep-PCR) to genotype microorganisms, was evaluated as a method to predict the serotype of Salmonella isolates. Two hundred and thirty-three Salmonella isolates belonging to 14 frequently isolated serotypes f...

  18. Applying a Predict-Observe-Explain Sequence in Teaching of Buoyant Force

    ERIC Educational Resources Information Center

    Radovanovic, Jelena; Slisko, Josip

    2013-01-01

    An active learning sequence based on the predict-observe-explain teaching strategy is applied to a lesson on buoyant force. The results obtained clearly justify the use of this teaching method and suggest devising a series of activities to enable more effective removal of students' commonly held alternative conceptions regarding floating and…

  19. Detection and isolation of nucleic acid sequences using competitive hybridization probes

    DOEpatents

    Lucas, Joe N.; Straume, Tore; Bogen, Kenneth T.

    1997-01-01

    A method for detecting a target nucleic acid sequence in a sample is provided using hybridization probes which competitively hybridize to a target nucleic acid. According to the method, a target nucleic acid sequence is hybridized to first and second hybridization probes which are complementary to overlapping portions of the target nucleic acid sequence, the first hybridization probe including a first complexing agent capable of forming a binding pair with a second complexing agent and the second hybridization probe including a detectable marker. The first complexing agent attached to the first hybridization probe is contacted with a second complexing agent, the second complexing agent being attached to a solid support such that when the first and second complexing agents are attached, target nucleic acid sequences hybridized to the first hybridization probe become immobilized on to the solid support. The immobilized target nucleic acids are then separated and detected by detecting the detectable marker attached to the second hybridization probe. A kit for performing the method is also provided.

  20. Detection and isolation of nucleic acid sequences using competitive hybridization probes

    DOEpatents

    Lucas, J.N.; Straume, T.; Bogen, K.T.

    1997-04-01

    A method for detecting a target nucleic acid sequence in a sample is provided using hybridization probes which competitively hybridize to a target nucleic acid. According to the method, a target nucleic acid sequence is hybridized to first and second hybridization probes which are complementary to overlapping portions of the target nucleic acid sequence, the first hybridization probe including a first complexing agent capable of forming a binding pair with a second complexing agent and the second hybridization probe including a detectable marker. The first complexing agent attached to the first hybridization probe is contacted with a second complexing agent, the second complexing agent being attached to a solid support such that when the first and second complexing agents are attached, target nucleic acid sequences hybridized to the first hybridization probe become immobilized on to the solid support. The immobilized target nucleic acids are then separated and detected by detecting the detectable marker attached to the second hybridization probe. A kit for performing the method is also provided. 7 figs.

  1. The outer capsid protein VP4 of equine rotavirus strain H-2 represents a unique VP4 type by amino acid sequence analysis.

    PubMed

    Hardy, M E; Gorziglia, M; Woode, G N

    1993-03-01

    The nucleotide and deduced amino acid sequence of G serotype 3 equine rotavirus strain H-2 was determined. A predicted 776-amino-acid H-2 VP4 shows less than or equal to 85.3% identity to other rotavirus VP4 types sequenced to date and thus represents a new P serotype. A PCR-generated probe derived from a cDNA clone of H-2 gene 4 hybridized to gene 4 of several tissue-culture-adapted equine rotavirus isolates, demonstrating that the gene 4 allele present in the H-2 strain is present in the equine rotavirus population. PMID:8382410

  2. FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences

    PubMed Central

    Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick

    2003-01-01

    We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms. PMID:12824407

  3. FrameD: A flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences.

    PubMed

    Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick

    2003-07-01

    We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms. PMID:12824407

  4. Comparative analysis of predicted plastid-targeted proteomes of sequenced higher plant genomes.

    PubMed

    Schaeffer, Scott; Harper, Artemus; Raja, Rajani; Jaiswal, Pankaj; Dhingra, Amit

    2014-01-01

    Plastids are actively involved in numerous plant processes critical to growth, development and adaptation. They play a primary role in photosynthesis, pigment and monoterpene synthesis, gravity sensing, starch and fatty acid synthesis, as well as oil, and protein storage. We applied two complementary methods to analyze the recently published apple genome (Malus × domestica) to identify putative plastid-targeted proteins, the first using TargetP and the second using a custom workflow utilizing a set of predictive programs. Apple shares roughly 40% of its 10,492 putative plastid-targeted proteins with that of the Arabidopsis (Arabidopsis thaliana) plastid-targeted proteome as identified by the Chloroplast 2010 project and ∼57% of its entire proteome with Arabidopsis. This suggests that the plastid-targeted proteomes between apple and Arabidopsis are different, and interestingly alludes to the presence of differential targeting of homologs between the two species. Co-expression analysis of 2,224 genes encoding putative plastid-targeted apple proteins suggests that they play a role in plant developmental and intermediary metabolism. Further, an inter-specific comparison of Arabidopsis, Prunus persica (Peach), Malus × domestica (Apple), Populus trichocarpa (Black cottonwood), Fragaria vesca (Woodland Strawberry), Solanum lycopersicum (Tomato) and Vitis vinifera (Grapevine) also identified a large number of novel species-specific plastid-targeted proteins. This analysis also revealed the presence of alternatively targeted homologs across species. Two separate analyses revealed that a small subset of proteins, one representing 289 protein clusters and the other 737 unique protein sequences, are conserved between seven plastid-targeted angiosperm proteomes. Majority of the novel proteins were annotated to play roles in stress response, transport, catabolic processes, and cellular component organization. Our results suggest that the current state of knowledge regarding

  5. Learning to Predict miRNA-mRNA Interactions from AGO CLIP Sequencing and CLASH Data.

    PubMed

    Lu, Yuheng; Leslie, Christina S

    2016-07-01

    Recent technologies like AGO CLIP sequencing and CLASH enable direct transcriptome-wide identification of AGO binding and miRNA target sites, but the most widely used miRNA target prediction algorithms do not exploit these data. Here we use discriminative learning on AGO CLIP and CLASH interactions to train a novel miRNA target prediction model. Our method combines two SVM classifiers, one to predict miRNA-mRNA duplexes and a second to learn a binding model of AGO's local UTR sequence preferences and positional bias in 3'UTR isoforms. The duplex SVM model enables the prediction of non-canonical target sites and more accurately resolves miRNA interactions from AGO CLIP data than previous methods. The binding model is trained using a multi-task strategy to learn context-specific and common AGO sequence preferences. The duplex and common AGO binding models together outperform existing miRNA target prediction algorithms on held-out binding data. Open source code is available at https://bitbucket.org/leslielab/chimiric. PMID:27438777

  6. Risk Prediction Modeling of Sequencing Data Using a Forward Random Field Method

    PubMed Central

    Wen, Yalu; He, Zihuai; Li, Ming; Lu, Qing

    2016-01-01

    With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects’ phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects’ genotypes, and an individual’s phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study. PMID:26892725

  7. NEP: web server for epitope prediction based on antibody neutralization of viral strains with diverse sequences.

    PubMed

    Chuang, Gwo-Yu; Liou, David; Kwong, Peter D; Georgiev, Ivelin S

    2014-07-01

    Delineation of the antigenic site, or epitope, recognized by an antibody can provide clues about functional vulnerabilities and resistance mechanisms, and can therefore guide antibody optimization and epitope-based vaccine design. Previously, we developed an algorithm for antibody-epitope prediction based on antibody neutralization of viral strains with diverse sequences and validated the algorithm on a set of broadly neutralizing HIV-1 antibodies. Here we describe the implementation of this algorithm, NEP (Neutralization-based Epitope Prediction), as a web-based server. The users must supply as input: (i) an alignment of antigen sequences of diverse viral strains; (ii) neutralization data for the antibody of interest against the same set of antigen sequences; and (iii) (optional) a structure of the unbound antigen, for enhanced prediction accuracy. The prediction results can be downloaded or viewed interactively on the antigen structure (if supplied) from the web browser using a JSmol applet. Since neutralization experiments are typically performed as one of the first steps in the characterization of an antibody to determine its breadth and potency, the NEP server can be used to predict antibody-epitope information at no additional experimental costs. NEP can be accessed on the internet at http://exon.niaid.nih.gov/nep. PMID:24782517

  8. Using complete genome comparisons to identify sequences whose presence accurately predicts clinically important phenotypes.

    PubMed

    Hall, Barry G; Cardenas, Heliodoro; Barlow, Miriam

    2013-01-01

    In clinical settings it is often important to know not just the identity of a microorganism, but also the danger posed by that particular strain. For instance, Escherichia coli can range from being a harmless commensal to being a very dangerous enterohemorrhagic (EHEC) strain. Determining pathogenic phenotypes can be both time consuming and expensive. Here we propose a simple, rapid, and inexpensive method of predicting pathogenic phenotypes on the basis of the presence or absence of short homologous DNA segments in an isolate. Our method compares completely sequenced genomes without the necessity of genome alignments in order to identify the presence or absence of the segments to produce an automatic alignment of the binary string that describes each genome. Analysis of the segment alignment allows identification of those segments whose presence strongly predicts a phenotype. Clinical application of the method requires nothing more that PCR amplification of each of the set of predictive segments. Here we apply the method to identifying EHEC strains of E. coli and to distinguishing E. coli from Shigella. We show in silico that with as few as 8 predictive sequences, if even three of those predictive sequences are amplified the probability of being EHEC or Shigella is >0.99. The method is thus very robust to the occasional amplification failure for spurious reasons. Experimentally, we apply the method to screening a set of 98 isolates to distinguishing E. coli from Shigella, and EHEC from non-EHEC E. coli strains and show that all isolates are correctly identified. PMID:23935901

  9. Learning to Predict miRNA-mRNA Interactions from AGO CLIP Sequencing and CLASH Data

    PubMed Central

    Lu, Yuheng; Leslie, Christina S.

    2016-01-01

    Recent technologies like AGO CLIP sequencing and CLASH enable direct transcriptome-wide identification of AGO binding and miRNA target sites, but the most widely used miRNA target prediction algorithms do not exploit these data. Here we use discriminative learning on AGO CLIP and CLASH interactions to train a novel miRNA target prediction model. Our method combines two SVM classifiers, one to predict miRNA-mRNA duplexes and a second to learn a binding model of AGO’s local UTR sequence preferences and positional bias in 3’UTR isoforms. The duplex SVM model enables the prediction of non-canonical target sites and more accurately resolves miRNA interactions from AGO CLIP data than previous methods. The binding model is trained using a multi-task strategy to learn context-specific and common AGO sequence preferences. The duplex and common AGO binding models together outperform existing miRNA target prediction algorithms on held-out binding data. Open source code is available at https://bitbucket.org/leslielab/chimiric. PMID:27438777

  10. Conservation of Shannon's redundancy for proteins. [information theory applied to amino acid sequences

    NASA Technical Reports Server (NTRS)

    Gatlin, L. L.

    1974-01-01

    Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.

  11. Site-directed gene mutation at mixed sequence targets by psoralen-conjugated pseudo-complementary peptide nucleic acids.

    PubMed

    Kim, Ki-Hyun; Nielsen, Peter E; Glazer, Peter M

    2007-01-01

    Sequence-specific DNA-binding molecules such as triple helix-forming oligonucleotides (TFOs) provide a means for inducing site-specific mutagenesis and recombination at chromosomal sites in mammalian cells. However, the utility of TFOs is limited by the requirement for homopurine stretches in the target duplex DNA. Here, we report the use of pseudo-complementary peptide nucleic acids (pcPNAs) for intracellular gene targeting at mixed sequence sites. Due to steric hindrance, pcPNAs are unable to form pcPNA-pcPNA duplexes but can bind to complementary DNA sequences by Watson-Crick pairing via double duplex-invasion complex formation. We show that psoralen-conjugated pcPNAs can deliver site-specific photoadducts and mediate targeted gene modification within both episomal and chromosomal DNA in mammalian cells without detectable off-target effects. Most of the induced psoralen-pcPNA mutations were single-base substitutions and deletions at the predicted pcPNA-binding sites. The pcPNA-directed mutagenesis was found to be dependent on PNA concentration and UVA dose and required matched pairs of pcPNAs. Neither of the individual pcPNAs alone had any effect nor did complementary PNA pairs of the same sequence. These results identify pcPNAs as new tools for site-specific gene modification in mammalian cells without purine sequence restriction, thereby providing a general strategy for designing gene targeting molecules. PMID:17977869

  12. Partner-Aware Prediction of Interacting Residues in Protein-Protein Complexes from Sequence Data

    PubMed Central

    Ahmad, Shandar; Mizuguchi, Kenji

    2011-01-01

    Computational prediction of residues that participate in protein-protein interactions is a difficult task, and state of the art methods have shown only limited success in this arena. One possible problem with these methods is that they try to predict interacting residues without incorporating information about the partner protein, although it is unclear how much partner information could enhance prediction performance. To address this issue, the two following comparisons are of crucial significance: (a) comparison between the predictability of inter-protein residue pairs, i.e., predicting exactly which residue pairs interact with each other given two protein sequences; this can be achieved by either combining conventional single-protein predictions or making predictions using a new model trained directly on the residue pairs, and the performance of these two approaches may be compared: (b) comparison between the predictability of the interacting residues in a single protein (irrespective of the partner residue or protein) from conventional methods and predictions converted from the pair-wise trained model. Using these two streams of training and validation procedures and employing similar two-stage neural networks, we showed that the models trained on pair-wise contacts outperformed the partner-unaware models in predicting both interacting pairs and interacting single-protein residues. Prediction performance decreased with the size of the conformational change upon complex formation; this trend is similar to docking, even though no structural information was used in our prediction. An example application that predicts two partner-specific interfaces of a protein was shown to be effective, highlighting the potential of the proposed approach. Finally, a preliminary attempt was made to score docking decoy poses using prediction of interacting residue pairs; this analysis produced an encouraging result. PMID:22194998

  13. DFLpred: High-throughput prediction of disordered flexible linker regions in protein sequences

    PubMed Central

    Meng, Fanchi; Kurgan, Lukasz

    2016-01-01

    Motivation: Disordered flexible linkers (DFLs) are disordered regions that serve as flexible linkers/spacers in multi-domain proteins or between structured constituents in domains. They are different from flexible linkers/residues because they are disordered and longer. Availability of experimentally annotated DFLs provides an opportunity to build high-throughput computational predictors of these regions from protein sequences. To date, there are no computational methods that directly predict DFLs and they can be found only indirectly by filtering predicted flexible residues with predictions of disorder. Results: We conceptualized, developed and empirically assessed a first-of-its-kind sequence-based predictor of DFLs, DFLpred. This method outputs propensity to form DFLs for each residue in the input sequence. DFLpred uses a small set of empirically selected features that quantify propensities to form certain secondary structures, disordered regions and structured regions, which are processed by a fast linear model. Our high-throughput predictor can be used on the whole-proteome scale; it needs <1 h to predict entire proteome on a single CPU. When assessed on an independent test dataset with low sequence-identity proteins, it secures area under the receiver operating characteristic curve equal 0.715 and outperforms existing alternatives that include methods for the prediction of flexible linkers, flexible residues, intrinsically disordered residues and various combinations of these methods. Prediction on the complete human proteome reveals that about 10% of proteins have a large content of over 30% DFL residues. We also estimate that about 6000 DFL regions are long with ≥30 consecutive residues. Availability and implementation: http://biomine.ece.ualberta.ca/DFLpred/. Contact: lkurgan@vcu.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307636

  14. Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources

    PubMed Central

    Mizianty, Marcin J.; Stach, Wojciech; Chen, Ke; Kedarisetti, Kanaka Durga; Disfani, Fatemeh Miri; Kurgan, Lukasz

    2010-01-01

    Motivation: Intrinsically disordered proteins play a crucial role in numerous regulatory processes. Their abundance and ubiquity combined with a relatively low quantity of their annotations motivate research toward the development of computational models that predict disordered regions from protein sequences. Although the prediction quality of these methods continues to rise, novel and improved predictors are urgently needed. Results: We propose a novel method, named MFDp (Multilayered Fusion-based Disorder predictor), that aims to improve over the current disorder predictors. MFDp is as an ensemble of 3 Support Vector Machines specialized for the prediction of short, long and generic disordered regions. It combines three complementary disorder predictors, sequence, sequence profiles, predicted secondary structure, solvent accessibility, backbone dihedral torsion angles, residue flexibility and B-factors. Our method utilizes a custom-designed set of features that are based on raw predictions and aggregated raw values and recognizes various types of disorder. The MFDp is compared at the residue level on two datasets against eight recent disorder predictors and top-performing methods from the most recent CASP8 experiment. In spite of using training chains with ≤25% similarity to the test sequences, our method consistently and significantly outperforms the other methods based on the MCC index. The MFDp outperforms modern disorder predictors for the binary disorder assignment and provides competitive real-valued predictions. The MFDp's outputs are also shown to outperform the other methods in the identification of proteins with long disordered regions. Availability: http://biomine.ece.ualberta.ca/MFDp.html Supplementary information: Supplementary data are available at Bioinformatics online. Contact: lkurgan@ece.ualberta.ca PMID:20823312

  15. Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns

    PubMed Central

    2007-01-01

    We have converted genome-encoded protein sequences into musical notes to reveal auditory patterns without compromising musicality. We derived a reduced range of 13 base notes by pairing similar amino acids and distinguishing them using variations of three-note chords and codon distribution to dictate rhythm. The conversion will help make genomic coding sequences more approachable for the general public, young children, and vision-impaired scientists. PMID:17477882

  16. Ab initio detection of fuzzy amino acid tandem repeats in protein sequences

    PubMed Central

    2012-01-01

    Background Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). Results In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. Conclusions PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available. PMID:22536906

  17. Multimodal phylogeny for taxonomy: integrating information from nucleotide and amino acid sequences.

    PubMed

    Bicego, Manuele; Dellaglio, Franco; Felis, Giovanna E

    2007-10-01

    The crucial role played by the analysis of microbial diversity in biotechnology-based innovations has increased the interest in the microbial taxonomy research area. Phylogenetic sequence analyses have contributed significantly to the advances in this field, also in the view of the large amount of sequence data collected in recent years. Phylogenetic analyses could be realized on the basis of protein-encoding nucleotide sequences or encoded amino acid molecules: these two mechanisms present different peculiarities, still starting from two alternative representations of the same information. This complementarity could be exploited to achieve a multimodal phylogenetic scheme that is able to integrate gene and protein information in order to realize a single final tree. This aspect has been poorly addressed in the literature. In this paper, we propose to integrate the two phylogenetic analyses using basic schemes derived from the multimodality fusion theory (or multiclassifier systems theory), a well-founded and rigorous branch for which its powerfulness has already been demonstrated in other pattern recognition contexts. The proposed approach could be applied to distance matrix-based phylogenetic techniques (like neighbor joining), resulting in a smart and fast method. The proposed methodology has been tested in a real case involving sequences of some species of lactic acid bacteria. With this dataset, both nucleotide sequence- and amino acid sequence-based phylogenetic analyses present some drawbacks, which are overcome with the multimodal analysis. PMID:17933011

  18. The amino-acid sequence of leghemoglobin component a from Phaseolus vulgaris (kidney bean).

    PubMed

    Lehtovaara, P; Ellfolk, N

    1975-06-01

    1. Leghemoglobin component a from Phaseolus vulgaris (kidney bean) was digested with trypsin; 15 tryptic peptides and free lysine were purified and the amino acid sequences of the peptides determined. 2. The internal order of the tryptic peptides was determined by the bridge peptides obtained from the thermolytic digest and the dilute acid hydrolyzate of kidney bean leghemoglobin a; 12 thermolytic peptides and two acid hydrolysis peptides were purified and the sequences were partially or completely determined. 3. The complete amino acid sequence of kidney bean leghemoglobin a is compared to that of leghemoglobin a from soybean (Glycine max) and to some animal globins. As regards sequence, the kidney bean globin has 79% identity with the soybean globin and 21% identity with human hemoglobin gamma-chain. Seven of the 14 amino acid residues common to most globins are found in the kidney bean globin. Trp-15 and Tyr-145 are evolutionarily conserved in this globin, which confirms the concept of a common origin of animal and plant globins. PMID:809270

  19. NMRDSP: an accurate prediction of protein shape strings from NMR chemical shifts and sequence data.

    PubMed

    Mao, Wusong; Cong, Peisheng; Wang, Zhiheng; Lu, Longjian; Zhu, Zhongliang; Li, Tonghua

    2013-01-01

    Shape string is structural sequence and is an extremely important structure representation of protein backbone conformations. Nuclear magnetic resonance chemical shifts give a strong correlation with the local protein structure, and are exploited to predict protein structures in conjunction with computational approaches. Here we demonstrate a novel approach, NMRDSP, which can accurately predict the protein shape string based on nuclear magnetic resonance chemical shifts and structural profiles obtained from sequence data. The NMRDSP uses six chemical shifts (HA, H, N, CA, CB and C) and eight elements of structure profiles as features, a non-redundant set (1,003 entries) as the training set, and a conditional random field as a classification algorithm. For an independent testing set (203 entries), we achieved an accuracy of 75.8% for S8 (the eight states accuracy) and 87.8% for S3 (the three states accuracy). This is higher than only using chemical shifts or sequence data, and confirms that the chemical shift and the structure profile are significant features for shape string prediction and their combination prominently improves the accuracy of the predictor. We have constructed the NMRDSP web server and believe it could be employed to provide a solid platform to predict other protein structures and functions. The NMRDSP web server is freely available at http://cal.tongji.edu.cn/NMRDSP/index.jsp. PMID:24376713

  20. NMRDSP: An Accurate Prediction of Protein Shape Strings from NMR Chemical Shifts and Sequence Data

    PubMed Central

    Mao, Wusong; Cong, Peisheng; Wang, Zhiheng; Lu, Longjian; Zhu, Zhongliang; Li, Tonghua

    2013-01-01

    Shape string is structural sequence and is an extremely important structure representation of protein backbone conformations. Nuclear magnetic resonance chemical shifts give a strong correlation with the local protein structure, and are exploited to predict protein structures in conjunction with computational approaches. Here we demonstrate a novel approach, NMRDSP, which can accurately predict the protein shape string based on nuclear magnetic resonance chemical shifts and structural profiles obtained from sequence data. The NMRDSP uses six chemical shifts (HA, H, N, CA, CB and C) and eight elements of structure profiles as features, a non-redundant set (1,003 entries) as the training set, and a conditional random field as a classification algorithm. For an independent testing set (203 entries), we achieved an accuracy of 75.8% for S8 (the eight states accuracy) and 87.8% for S3 (the three states accuracy). This is higher than only using chemical shifts or sequence data, and confirms that the chemical shift and the structure profile are significant features for shape string prediction and their combination prominently improves the accuracy of the predictor. We have constructed the NMRDSP web server and believe it could be employed to provide a solid platform to predict other protein structures and functions. The NMRDSP web server is freely available at http://cal.tongji.edu.cn/NMRDSP/index.jsp. PMID:24376713

  1. PDP-CON: prediction of domain/linker residues in protein sequences using a consensus approach.

    PubMed

    Chatterjee, Piyali; Basu, Subhadip; Zubek, Julian; Kundu, Mahantapas; Nasipuri, Mita; Plewczynski, Dariusz

    2016-04-01

    The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers-decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron-were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use. PMID:26969678

  2. A classification of glycosyl hydrolases based on amino acid sequence similarities.

    PubMed Central

    Henrissat, B

    1991-01-01

    The amino acid sequences of 301 glycosyl hydrolases and related enzymes have been compared. A total of 291 sequences corresponding to 39 EC entries could be classified into 35 families. Only ten sequences (less than 5% of the sample) could not be assigned to any family. With the sequences available for this analysis, 18 families were found to be monospecific (containing only one EC number) and 17 were found to be polyspecific (containing at least two EC numbers). Implications on the folding characteristics and mechanism of action of these enzymes and on the evolution of carbohydrate metabolism are discussed. With the steady increase in sequence and structural data, it is suggested that the enzyme classification system should perhaps be revised. PMID:1747104

  3. New families in the classification of glycosyl hydrolases based on amino acid sequence similarities.

    PubMed Central

    Henrissat, B; Bairoch, A

    1993-01-01

    301 glycosyl hydrolases and related enzymes corresponding to 39 EC entries of the I.U.B. classification system have been classified into 35 families on the basis of amino-acid-sequence similarities [Henrissat (1991) Biochem. J. 280, 309-316]. Approximately half of the families were found to be monospecific (containing only one EC number), whereas the other half were found to be polyspecific (containing at least two EC numbers). A > 60% increase in sequence data for glycosyl hydrolases (181 additional enzymes or enzyme domains sequences have since become available) allowed us to update the classification not only by the addition of more members to already identified families, but also by the finding of ten new families. On the basis of a comparison of 482 sequences corresponding to 52 EC entries, 45 families, out of which 22 are polyspecific, can now be defined. This classification has been implemented in the SWISS-PROT protein sequence data bank. PMID:8352747

  4. Sequence-specific purification of nucleic acids by PNA-controlled hybrid selection.

    PubMed

    Orum, H; Nielsen, P E; Jørgensen, M; Larsson, C; Stanley, C; Koch, T

    1995-09-01

    Using an oligohistidine peptide nucleic acids (oligohistidine-PNA) chimera, we have developed a rapid hybrid selection method that allows efficient, sequence-specific purification of a target nucleic acid. The method exploits two fundamental features of PNA. First, that PNA binds with high affinity and specificity to its complementary nucleic acid. Second, that amino acids are easily attached to the PNA oligomer during synthesis. We show that a (His)6-PNA chimera exhibits strong binding to chelated Ni2+ ions without compromising its native PNA hybridization properties. We further show that these characteristics allow the (His)6-PNA/DNA complex to be purified by the well-established method of metal ion affinity chromatography using a Ni(2+)-NTA (nitrilotriactic acid) resin. Specificity and efficiency are the touchstones of any nucleic acid purification scheme. We show that the specificity of the (His)6-PNA selection approach is such that oligonucleotides differing by only a single nucleotide can be selectively purified. We also show that large RNAs (2224 nucleotides) can be captured with high efficiency by using multiple (His)6-PNA probes. PNA can hybridize to nucleic acids in low-salt concentrations that destabilize native nucleic acid structures. We demonstrate that this property of PNA can be utilized to purify an oligonucleotide in which the target sequence forms part of an intramolecular stem/loop structure. PMID:7495562

  5. Structural protein descriptors in 1-dimension and their sequence-based predictions.

    PubMed

    Kurgan, Lukasz; Disfani, Fatemeh Miri

    2011-09-01

    The last few decades observed an increasing interest in development and application of 1-dimensional (1D) descriptors of protein structure. These descriptors project 3D structural features onto 1D strings of residue-wise structural assignments. They cover a wide-range of structural aspects including conformation of the backbone, burying depth/solvent exposure and flexibility of residues, and inter-chain residue-residue contacts. We perform first-of-its-kind comprehensive comparative review of the existing 1D structural descriptors. We define, review and categorize ten structural descriptors and we also describe, summarize and contrast over eighty computational models that are used to predict these descriptors from the protein sequences. We show that the majority of the recent sequence-based predictors utilize machine learning models, with the most popular being neural networks, support vector machines, hidden Markov models, and support vector and linear regressions. These methods provide high-throughput predictions and most of them are accessible to a non-expert user via web servers and/or stand-alone software packages. We empirically evaluate several recent sequence-based predictors of secondary structure, disorder, and solvent accessibility descriptors using a benchmark set based on CASP8 targets. Our analysis shows that the secondary structure can be predicted with over 80% accuracy and segment overlap (SOV), disorder with over 0.9 AUC, 0.6 Matthews Correlation Coefficient (MCC), and 75% SOV, and relative solvent accessibility with PCC of 0.7 and MCC of 0.6 (0.86 when homology is used). We demonstrate that the secondary structure predicted from sequence without the use of homology modeling is as good as the structure extracted from the 3D folds predicted by top-performing template-based methods. PMID:21787299

  6. In silico comparative analysis of DNA and amino acid sequences for prion protein gene.

    PubMed

    Kim, Y; Lee, J; Lee, C

    2008-01-01

    Genetic variability might contribute to species specificity of prion diseases in various organisms. In this study, structures of the prion protein gene (PRNP) and its amino acids were compared among species of which sequence data were available. Comparisons of PRNP DNA sequences among 12 species including human, chimpanzee, monkey, bovine, ovine, dog, mouse, rat, wallaby, opossum, chicken and zebrafish allowed us to identify candidate regulatory regions in intron 1 and 3'-untranslated region (UTR) in addition to the coding region. Highly conserved putative binding sites for transcription factors, such as heat shock factor 2 (HSF2) and myocite enhancer factor 2 (MEF2), were discovered in the intron 1. In 3'-UTR, the functional sequence (ATTAAA) for nucleus-specific polyadenylation was found in all the analysed species. The functional sequence (TTTTTAT) for maturation-specific polyadenylation was identically observed only in ovine, and one or two nucleotide mismatches in the other species. A comparison of the amino acid sequences in 53 species revealed a large sequence identity. Especially the octapeptide repeat region was observed in all the species but frog and zebrafish. Functional changes and susceptibility to prion diseases with various isoforms of prion protein could be caused by numeric variability and conformational changes discovered in the repeat sequences. PMID:18397498

  7. The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity.

    PubMed

    Petrovski, Slavé; Gussow, Ayal B; Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H; Allen, Andrew S; Goldstein, David B

    2015-09-01

    Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene's proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene's regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen's Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, nc

  8. The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity

    PubMed Central

    Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H.; Allen, Andrew S.; Goldstein, David B.

    2015-01-01

    Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene’s proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene’s regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen’s Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance

  9. Antibody-specific model of amino acid substitution for immunological inferences from alignments of antibody sequences.

    PubMed

    Mirsky, Alexander; Kazandjian, Linda; Anisimova, Maria

    2015-03-01

    Antibodies are glycoproteins produced by the immune system as a dynamically adaptive line of defense against invading pathogens. Very elegant and specific mutational mechanisms allow B lymphocytes to produce a large and diversified repertoire of antibodies, which is modified and enhanced throughout all adulthood. One of these mechanisms is somatic hypermutation, which stochastically mutates nucleotides in the antibody genes, forming new sequences with different properties and, eventually, higher affinity and selectivity to the pathogenic target. As somatic hypermutation involves fast mutation of antibody sequences, this process can be described using a Markov substitution model of molecular evolution. Here, using large sets of antibody sequences from mice and humans, we infer an empirical amino acid substitution model AB, which is specific to antibody sequences. Compared with existing general amino acid models, we show that the AB model provides significantly better description for the somatic evolution of mice and human antibody sequences, as demonstrated on large next generation sequencing (NGS) antibody data. General amino acid models are reflective of conservation at the protein level due to functional constraints, with most frequent amino acids exchanges taking place between residues with the same or similar physicochemical properties. In contrast, within the variable part of antibody sequences we observed an elevated frequency of exchanges between amino acids with distinct physicochemical properties. This is indicative of a sui generis mutational mechanism, specific to antibody somatic hypermutation. We illustrate this property of antibody sequences by a comparative analysis of the network modularity implied by the AB model and general amino acid substitution models. We recommend using the new model for computational studies of antibody sequence maturation, including inference of alignments and phylogenetic trees describing antibody somatic hypermutation in

  10. Noise occlusion in discrete tone sequences as a tool towards auditory predictive processing?

    PubMed

    Bendixen, Alexandra; Duwe, Susann; Reiche, Martin

    2015-11-11

    The notion of predictive coding is a common feature of many theories of auditory information processing. Experimental demonstrations of predictive auditory processing often rest on omitting predictable input in order to uncover the prediction made by the brain. Findings show that auditory cortical activity elicited by the omission of a predictable tone resembles the activity elicited by the actual tone. Here we attempted to extend this approach towards using noises instead of omissions in order to capture a more prevalent case of degraded sensory input. By applying a subtraction approach to remove ERP effects of the noise itself, auditory cortical activity elicited "behind" the noise was uncovered. We hypothesized that ERPs elicited behind noise stimuli covering predictable tones should be more similar to ERPs elicited by the actual tones than when the same comparison is made for unpredictable tones. ERP results during passive listening partly confirm this hypothesis, but also point towards some methodological caveats in this particular approach towards studying neural correlates of predictive auditory processing due to contributions from predictability-unrelated factors. A follow-up active listening condition indicated that participants were not more likely to perceive the tone sequence as continuous when a predictable tone was covered with noise than when this pertained to an unpredictable tone. Overall, the noise-based paradigm in its present form was not shown to be successful in revealing predictive processing in perceptual judgments or early neural correlates of sound processing. We discuss these findings in the contexts of predictive processing and illusory auditory continuity. This article is part of a Special Issue entitled SI: Prediction and Attention. PMID:26187755

  11. Complete amino acid sequence of human plasma Zn-. cap alpha. /sub 2/-glycoprotein and its homology to histocompatibility antigens

    SciTech Connect

    Araki, T.; Gejyo, F.; Takagaki, K.; Haupt, H.; Schwick, H.G.; Buergi, W.; Marti, T.; Schaller, J.; Rickli, E.; Brossmer, R.

    1988-02-01

    In the present study the complete amino acid sequence of human plasma Zn-..cap alpha../sub 2/-glycoprotein was determined. This protein whose biological function is unknown consists of a single polypeptide chain of 276 amino acid residues including 8 tryptophan residues and has a pyroglutamyl residue at the amino terminus. The location of the two disulfide bonds in the polypeptide chain was also established. The three glycans, whose structure was elucidated with the aid of 500 MHz /sup 1/H NMR spectroscopy, were sialylated N-biantennas. The molecular weight calculated from the polypeptide and carbohydrate structure is 38,478, which is close to the reported value of approx. = 41,000 based on physicochemical measurements. The predicted secondary structure appeared to comprised of 23% ..cap alpha..-helix, 27% ..beta..-sheet, and 22% ..beta..-turns. The three N-glycans were found to be located in ..beta..-turn regions. An unexpected finding was made by computer analysis of the sequence data; this revealed that Zn-..cap alpha../sub 2/-glycoprotein is closely related to antigens of the major histocompatibility complex in amino acid sequence and in domain structure. There was an unusually high degree of sequence homology with the ..cap alpha.. chains of class I histocompatibility antigens. Moreover, this plasma protein was shown to be a member of the immunoglobulin gene superfamily. Zn-..cap alpha../sub 2/-glycoprotein appears to be truncated secretory major histocompatibility complex-related molecule, and it may have a role in the expression of the immune response.

  12. Functional Divergence in the Genus Oenococcus as Predicted by Genome Sequencing of the Newly-Described Species, Oenococcus kitaharae

    PubMed Central

    Borneman, Anthony R.; McCarthy, Jane M.; Chambers, Paul J.; Bartowsky, Eveline J.

    2012-01-01

    Oenococcus kitaharae is only the second member of the genus Oenococcus to be identified and is the closest relative of the industrially important wine bacterium Oenococcus oeni. To provide insight into this new species, the genome of the type strain of O. kitaharae, DSM 17330, was sequenced. Comparison of the sequenced genomes of both species show that the genome of O. kitaharae DSM 17330 contains many genes with predicted functions in cellular defence (bacteriocins, antimicrobials, restriction-modification systems and a CRISPR locus) which are lacking in O. oeni. The two genomes also appear to differentially encode several metabolic pathways associated with amino acid biosynthesis and carbohydrate utilization and which have direct phenotypic consequences. This would indicate that the two species have evolved different survival techniques to suit their particular environmental niches. O. oeni has adapted to survive in the harsh, but predictable, environment of wine that provides very few competitive species. However O. kitaharae appears to have adapted to a growth environment in which biological competition provides a significant selective pressure by accumulating biological defence molecules, such as bacteriocins and restriction-modification systems, throughout its genome. PMID:22235313

  13. Analyses of mitochondrial amino acid sequence datasets support the proposal that specimens of Hypodontus macropi from three species of macropodid hosts represent distinct species

    PubMed Central

    2013-01-01

    Background Hypodontus macropi is a common intestinal nematode of a range of kangaroos and wallabies (macropodid marsupials). Based on previous multilocus enzyme electrophoresis (MEE) and nuclear ribosomal DNA sequence data sets, H. macropi has been proposed to be complex of species. To test this proposal using independent molecular data, we sequenced the whole mitochondrial (mt) genomes of individuals of H. macropi from three different species of hosts (Macropus robustus robustus, Thylogale billardierii and Macropus [Wallabia] bicolor) as well as that of Macropicola ocydromi (a related nematode), and undertook a comparative analysis of the amino acid sequence datasets derived from these genomes. Results The mt genomes sequenced by next-generation (454) technology from H. macropi from the three host species varied from 13,634 bp to 13,699 bp in size. Pairwise comparisons of the amino acid sequences predicted from these three mt genomes revealed differences of 5.8% to 18%. Phylogenetic analysis of the amino acid sequence data sets using Bayesian Inference (BI) showed that H. macropi from the three different host species formed distinct, well-supported clades. In addition, sliding window analysis of the mt genomes defined variable regions for future population genetic studies of H. macropi in different macropodid hosts and geographical regions around Australia. Conclusions The present analyses of inferred mt protein sequence datasets clearly supported the hypothesis that H. macropi from M. robustus robustus, M. bicolor and T. billardierii represent distinct species. PMID:24261823

  14. Structure-Templated Predictions of Novel Protein Interactions from Sequence Information

    PubMed Central

    Betel, Doron; Breitkreuz, Kevin E; Isserlin, Ruth; Dewar-Darch, Danielle; Tyers, Mike; Hogue, Christopher W. V

    2007-01-01

    The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain–motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information. PMID:17892321

  15. Amino acid sequence of a vitamin K-dependent Ca2+-binding peptide from bovine prothrombin.

    PubMed

    Howard, J B; Fausch, M D

    1975-08-10

    The amino acid sequence of a 31-residue peptide from bovine prothrombin has been determined. This peptide has been shown to contain the vitamin K-dependent modification required for Ca2+ binding (Nelsestuen, G. L., and Suttie, J. W. (1973) Proc. Natl. Acad. Sci. U. S. A. 70, 3366-3370) and the modified amino acid, gamma-carboxyglutamic acid (Nelsestuen, G. L., Zytkovicz, T., and Howard, J. B. (1974) J. Biol. Chem. 249, 6347-6350). The peptide was shown to correspond to residues 12 to 42 of prothrombin. PMID:807581

  16. Amino acid sequences around the cysteine residues of rabbit muscle triose phosphate isomerase

    PubMed Central

    Miller, Janet C.; Waley, S. G.

    1971-01-01

    1. The nature of the subunits in rabbit muscle triose phosphate isomerase has been investigated. 2. Amino acid analyses show that there are five cysteine residues and two methionine residues/subunit. 3. The amino acid sequences around the cysteine residues have been determined; these account for about 75 residues. 4. Cleavage at the methionine residues with cyanogen bromide gave three fragments. 5. These results show that the subunits correspond to polypeptide chains, containing about 230 amino acid residues. The chains in triose phosphate isomerase seem to be shorter than those of other glycolytic enzymes. PMID:5165707

  17. Linguistic and spatial skills predict early arithmetic development via counting sequence knowledge.

    PubMed

    Zhang, Xiao; Koponen, Tuire; Räsänen, Pekka; Aunola, Kaisa; Lerkkanen, Marja-Kristiina; Nurmi, Jari-Erik

    2014-01-01

    Utilizing a longitudinal sample of Finnish children (ages 6-10), two studies examined how early linguistic (spoken vs. written) and spatial skills predict later development of arithmetic, and whether counting sequence knowledge mediates these associations. In Study 1 (N = 1,880), letter knowledge and spatial visualization, measured in kindergarten, predicted the level of arithmetic in first grade, and later growth through third grade. Study 2 (n = 378) further showed that these associations were mediated by counting sequence knowledge measured in first grade. These studies add to the literature by demonstrating the importance of written language for arithmetic development. The findings are consistent with the hypothesis that linguistic and spatial skills can improve arithmetic development by enhancing children's number-related knowledge. PMID:24148144

  18. Complete amino acid sequence of the Mu heavy chain of a human IgM immunoglobulin.

    PubMed

    Putnam, F W; Florent, G; Paul, C; Shinoda, T; Shimizu, A

    1973-10-19

    The amino acid sequence of the micro, chain of a human IgM immunoglobulin, including the location of all disulfide bridges and oligosaccharides, has been determined. The homology of the constant regions of immunoglobulin micro, gamma, alpha, and epsilon heavy chains reveals evolutionary relationships and suggests that two genes code for each heavy chain. PMID:4742735

  19. Draft Genome Sequence of the Butyric Acid Producer Clostridium tyrobutyricum Strain CIP I-776 (IFP923)

    PubMed Central

    Clément, Benjamin; Lopes Ferreira, Nicolas

    2016-01-01

    Here, we report the draft genome sequence of Clostridium tyrobutyricum CIP I-776 (IFP923), an efficient producer of butyric acid. The genome consists of a single chromosome of 3.19 Mb and provides useful data concerning the metabolic capacities of the strain. PMID:26941139

  20. Draft Genome Sequence of Perfluorooctane Acid-Degrading Bacterium Pseudomonas parafulva YAB-1

    PubMed Central

    Tang, Chongjian; Peng, Qingjing; Peng, Qingzhong

    2015-01-01

    Pseudomonas parafulva YAB-1, isolated from perfluorinated compound-contaminated soil, has the ability to degrade perfluorooctane acid (PFOA) compound. Here, we report the draft genome sequence and annotation of the PFOA-degrading bacterium P. parafulva YAB-1. The data provide the basis to investigate the molecular mechanism of PFOA metabolism. PMID:26337877

  1. Cloning, sequence analysis and three-dimensional structure prediction of DNA pol I from thermophilic Geobacillus sp. MKK isolated from an Iranian hot spring.

    PubMed

    Khalaj-Kondori, Mohammad; Sadeghizadeh, Majid; Khajeh, Khosro; Naderi-Manesh, Hossein; Ahadi, Ali Mohammad; Emamzadeh, Abdorahman

    2007-08-01

    Molecular phylogenetic analysis of a novel thermophilic eubacterium isolated from an Iranian hot spring using 16S rDNA sequence showed that the new isolate belongs to genera Geobacillus. DNA pol I gene from this isolate was amplified, cloned, sequenced, and the three-dimensional (3D) structure of deduced amino acid sequence was predicted. Sequence analysis revealed the gene is 2,631 bp long, encodes a protein of 876 amino acids with a calculated molecular mass of 99 kDa, and belongs to family A DNA polymerases. Comparison of 3'-5'exonuclease domain of Klenow fragment (KF) with corresponding region of newly identified DNA pol I (MF), the large fragment of Bacillus stearothermophilus DNA pol I (BF) and Klentaq1, revealed not only deletions in three regions compared to KF, but that three of the four critical metal-binding residues in KF (Asp355, Glu357, Asp424, and Asp501) are altered in MF as well. Predicted 3D structure and sequence alignments between MF and BF showed that all critical residues in the polymerase active site are conserved. PMID:18025581

  2. The amino acid sequence of cytochrome c-555 from the methane-oxidizing bacterium Methylococcus capsulatus.

    PubMed Central

    Ambler, R P; Dalton, H; Meyer, T E; Bartsch, R G; Kamen, M D

    1986-01-01

    The amino acid sequence of the cytochrome c-555 from the obligate methanotroph Methylococcus capsulatus strain Bath (N.C.I.B. 11132) was determined. It is a single polypeptide chain of 96 residues, binding a haem group through the cysteine residues at positions 19 and 22, and the only methionine residue is a position 59. The sequence does not closely resemble that of any other cytochrome c that has yet been characterized. Detailed evidence for the amino acid sequence of the protein has been deposited as Supplementary Publication SUP 50131 (12 pages) at the British Library Lending Division, Boston Spa, West Yorkshire LS23 7BQ, U.K., from whom copies are available on prepayment. PMID:3006666

  3. Sequence-based prediction of protein-peptide binding sites using support vector machine.

    PubMed

    Taherzadeh, Ghazaleh; Yang, Yuedong; Zhang, Tuo; Liew, Alan Wee-Chung; Zhou, Yaoqi

    2016-05-15

    Protein-peptide interactions are essential for all cellular processes including DNA repair, replication, gene-expression, and metabolism. As most protein-peptide interactions are uncharacterized, it is cost effective to investigate them computationally as the first step. All existing approaches for predicting protein-peptide binding sites, however, are based on protein structures despite the fact that the structures for most proteins are not yet solved. This article proposes the first machine-learning method called SPRINT to make Sequence-based prediction of Protein-peptide Residue-level Interactions. SPRINT yields a robust and consistent performance for 10-fold cross validations and independent test. The most important feature is evolution-generated sequence profiles. For the test set (1056 binding and non-binding residues), it yields a Matthews' Correlation Coefficient of 0.326 with a sensitivity of 64% and a specificity of 68%. This sequence-based technique shows comparable or more accurate than structure-based methods for peptide-binding site prediction. SPRINT is available as an online server at: http://sparks-lab.org/. © 2016 Wiley Periodicals, Inc. PMID:26833816

  4. Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties

    PubMed Central

    2014-01-01

    Background Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences. Results The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets. Conclusion Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins. PMID:25521329

  5. Allelic polymorphism in arabian camel ribonuclease and the amino acid sequence of bactrian camel ribonuclease.

    PubMed

    Welling, G W; Mulder, H; Beintema, J J

    1976-04-01

    Pancreatic ribonucleases from several species (whitetail deer, roe deer, guinea pig, and arabian camel) exhibit more than one amino acid at particular positions in their amino acid sequences. Since these enzymes were isolated from pooled pancreas, the origin of this heterogeneity is not clear. The pancreatic ribonucleases from 11 individual arabian camels (Camelus dromedarius) have been investigated with respect to the lysine-glutamine heterogeneity at position 103 (Welling et al., 1975). Six ribonucleases showed only one basic band and five showed two bands after polyacrylamide gel electrophoresis, suggesting a gene frequency of about 0.75 for the Lys gene and about 0.25 for the Gln gene. The amino acid sequence of bactrian camel (Camelus bactrianus) ribonuclease isolated from individual pancreatic tissue was determined and compared with that of arabian camel ribonuclease. The only difference was observed at position 103. In the ribonucleases from two unrelated bactrian camels, only glutamine was observed at that position. PMID:962846

  6. Many amino acid substitution variants identified in DNA repair genes during human population screenings are predicted to impact protein function

    SciTech Connect

    Xi, T; Jones, I M; Mohrenweiser, H W

    2003-11-03

    Over 520 different amino acid substitution variants have been previously identified in the systematic screening of 91 human DNA repair genes for sequence variation. Two algorithms were employed to predict the impact of these amino acid substitutions on protein activity. Sorting Intolerant From Tolerant (SIFT) classified 226 of 508 variants (44%) as ''Intolerant''. Polymorphism Phenotyping (PolyPhen) classed 165 of 489 amino acid substitutions (34%) as ''Probably or Possibly Damaging''. Another 9-15% of the variants were classed as ''Potentially Intolerant or Damaging''. The results from the two algorithms are highly associated, with concordance in predicted impact observed for {approx}62% of the variants. Twenty one to thirty one percent of the variant proteins are predicted to exhibit reduced activity by both algorithms. These variants occur at slightly lower individual allele frequency than do the variants classified as ''Tolerant'' or ''Benign''. Both algorithms correctly predicted the impact of 26 functionally characterized amino acid substitutions in the APE1 protein on biochemical activity, with one exception. It is concluded that a substantial fraction of the missense variants observed in the general human population are functionally relevant. These variants are expected to be the molecular genetic and biochemical basis for the associations of reduced DNA repair capacity phenotypes with elevated cancer risk.

  7. Quantitative analysis and prediction of G-quadruplex forming sequences in double-stranded DNA

    PubMed Central

    Kim, Minji; Kreig, Alex; Lee, Chun-Ying; Rube, H. Tomas; Calvert, Jacob; Song, Jun S.; Myong, Sua

    2016-01-01

    G-quadruplex (GQ) is a four-stranded DNA structure that can be formed in guanine-rich sequences. GQ structures have been proposed to regulate diverse biological processes including transcription, replication, translation and telomere maintenance. Recent studies have demonstrated the existence of GQ DNA in live mammalian cells and a significant number of potential GQ forming sequences in the human genome. We present a systematic and quantitative analysis of GQ folding propensity on a large set of 438 GQ forming sequences in double-stranded DNA by integrating fluorescence measurement, single-molecule imaging and computational modeling. We find that short minimum loop length and the thymine base are two main factors that lead to high GQ folding propensity. Linear and Gaussian process regression models further validate that the GQ folding potential can be predicted with high accuracy based on the loop length distribution and the nucleotide content of the loop sequences. Our study provides important new parameters that can inform the evaluation and classification of putative GQ sequences in the human genome. PMID:27095201

  8. Quantitative analysis and prediction of G-quadruplex forming sequences in double-stranded DNA.

    PubMed

    Kim, Minji; Kreig, Alex; Lee, Chun-Ying; Rube, H Tomas; Calvert, Jacob; Song, Jun S; Myong, Sua

    2016-06-01

    G-quadruplex (GQ) is a four-stranded DNA structure that can be formed in guanine-rich sequences. GQ structures have been proposed to regulate diverse biological processes including transcription, replication, translation and telomere maintenance. Recent studies have demonstrated the existence of GQ DNA in live mammalian cells and a significant number of potential GQ forming sequences in the human genome. We present a systematic and quantitative analysis of GQ folding propensity on a large set of 438 GQ forming sequences in double-stranded DNA by integrating fluorescence measurement, single-molecule imaging and computational modeling. We find that short minimum loop length and the thymine base are two main factors that lead to high GQ folding propensity. Linear and Gaussian process regression models further validate that the GQ folding potential can be predicted with high accuracy based on the loop length distribution and the nucleotide content of the loop sequences. Our study provides important new parameters that can inform the evaluation and classification of putative GQ sequences in the human genome. PMID:27095201

  9. Species specific amino acid sequence-protein local structure relationships: An analysis in the light of a structural alphabet.

    PubMed

    de Brevern, Alexandre G; Joseph, Agnel Praveen

    2011-05-01

    Protein structure analysis and prediction methods are based on non-redundant data extracted from the available protein structures, regardless of the species from which the protein originates. Hence, these datasets represent the global knowledge on protein folds, which constitutes a generic distribution of amino acid sequence-protein structure (AAS-PS) relationships. In this study, we try to elucidate whether the AAS-PS relationship could possess specificities depending on the specie. For this purpose, we have chosen three different species: Saccharomyces cerevisiae, Plasmodium falciparum and Arabidopsis thaliana. We analyzed the AAS-PS behaviors of the proteins from these three species and compared it to the "expected" distribution of a classical non-redundant databank. With the classical secondary structure description, only slight differences in amino acid preferences could be observed. With a more precise description of local protein structures (Protein Blocks), significant changes could be highlighted. S. cerevisiae's AAS-PS relationship is close to the general distribution, while striking differences are observed in the case of A. thaliana. P. falciparum is the most distant one. This study presents some interesting view-points on AAS-PS relationship. Certain species exhibit unique preferences for amino acids to be associated with protein local structural elements. Thus, AAS-PS relationships are species dependent. These results can give useful insights for improving prediction methodologies which take the species specific information into account. PMID:21333657

  10. Software scripts for quality checking of high-throughput nucleic acid sequencers.

    PubMed

    Lazo, G R; Tong, J; Miller, R; Hsia, C; Rausch, C; Kang, Y; Anderson, O D

    2001-06-01

    We have developed a graphical interface to allow the researcher to view and assess the quality of sequencing results using a series of program scripts developed to process data generated by automated sequencers. The scripts are written in Perl programming language and are executable under the cgibin directory of a Web server environment. The scripts direct nucleic acid sequencing trace file data output from automated sequencers to be analyzed by the phred molecular biology program and are displayed as graphical hypertext mark-up language (HTML) pages. The scripts are mainly designed to handle 96-well microtiter dish samples, but the scripts are also able to read data from 384-well microtiter dishes 96 samples at a time. The scripts may be customized for different laboratory environments and computer configurations. Web links to the sources and discussion page are provided. PMID:11414222

  11. The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective

    PubMed Central

    Rivas, Elena

    2013-01-01

    Any method for RNA secondary structure prediction is determined by four ingredients. The architecture is the choice of features implemented by the model (such as stacked basepairs, loop length distributions, etc.). The architecture determines the number of parameters in the model. The scoring scheme is the nature of those parameters (whether thermodynamic, probabilistic, or weights). The parameterization stands for the specific values assigned to the parameters. These three ingredients are referred to as “the model.” The fourth ingredient is the folding algorithms used to predict plausible secondary structures given the model and the sequence of a structural RNA. Here, I make several unifying observations drawn from looking at more than 40 years of methods for RNA secondary structure prediction in the light of this classification. As a final observation, there seems to be a performance ceiling that affects all methods with complex architectures, a ceiling that impacts all scoring schemes with remarkable similarity. This suggests that modeling RNA secondary structure by using intrinsic sequence-based plausible “foldability” will require the incorporation of other forms of information in order to constrain the folding space and to improve prediction accuracy. This could give an advantage to probabilistic scoring systems since a probabilistic framework is a natural platform to incorporate different sources of information into one single inference problem. PMID:23695796

  12. Using machine learning to predict gene expression and discover sequence motifs

    NASA Astrophysics Data System (ADS)

    Li, Xuejing

    Recently, large amounts of experimental data for complex biological systems have become available. We use tools and algorithms from machine learning to build data-driven predictive models. We first present a novel algorithm to discover gene sequence motifs associated with temporal expression patterns of genes. Our algorithm, which is based on partial least squares (PLS) regression, is able to directly model the flow of information, from gene sequence to gene expression, to learn cis regulatory motifs and characterize associated gene expression patterns. Our algorithm outperforms traditional computational methods e.g. clustering in motif discovery. We then present a study of extending a machine learning model for transcriptional regulation predictive of genetic regulatory response to Caenorhabditis elegans. We show meaningful results both in terms of prediction accuracy on the test experiments and biological information extracted from the regulatory program. The model discovers DNA binding sites ab initio. We also present a case study where we detect a signal of lineage-specific regulation. Finally we present a comparative study on learning predictive models for motif discovery, based on different boosting algorithms: Adaptive Boosting (AdaBoost), Linear Programming Boosting (LPBoost) and Totally Corrective Boosting (TotalBoost). We evaluate and compare the performance of the three boosting algorithms via both statistical and biological validation, for hypoxia response in Saccharomyces cerevisiae.

  13. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies

    PubMed Central

    Dong, Chengliang; Wei, Peng; Jian, Xueqiu; Gibbs, Richard; Boerwinkle, Eric; Wang, Kai; Liu, Xiaoming

    2015-01-01

    Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database. PMID:25552646

  14. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.

    PubMed

    Dong, Chengliang; Wei, Peng; Jian, Xueqiu; Gibbs, Richard; Boerwinkle, Eric; Wang, Kai; Liu, Xiaoming

    2015-04-15

    Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database. PMID:25552646

  15. Efficient Nucleic Acid Extraction and 16S rRNA Gene Sequencing for Bacterial Community Characterization.

    PubMed

    Anahtar, Melis N; Bowman, Brittany A; Kwon, Douglas S

    2016-01-01

    There is a growing appreciation for the role of microbial communities as critical modulators of human health and disease. High throughput sequencing technologies have allowed for the rapid and efficient characterization of bacterial communities using 16S rRNA gene sequencing from a variety of sources. Although readily available tools for 16S rRNA sequence analysis have standardized computational workflows, sample processing for DNA extraction remains a continued source of variability across studies. Here we describe an efficient, robust, and cost effective method for extracting nucleic acid from swabs. We also delineate downstream methods for 16S rRNA gene sequencing, including generation of sequencing libraries, data quality control, and sequence analysis. The workflow can accommodate multiple samples types, including stool and swabs collected from a variety of anatomical locations and host species. Additionally, recovered DNA and RNA can be separated and used for other applications, including whole genome sequencing or RNA-seq. The method described allows for a common processing approach for multiple sample types and accommodates downstream analysis of genomic, metagenomic and transcriptional information. PMID:27168460

  16. Efficient Nucleic Acid Extraction and 16S rRNA Gene Sequencing for Bacterial Community Characterization

    PubMed Central

    Anahtar, Melis N.; Bowman, Brittany A.; Kwon, Douglas S.

    2016-01-01

    There is a growing appreciation for the role of microbial communities as critical modulators of human health and disease. High throughput sequencing technologies have allowed for the rapid and efficient characterization of bacterial communities using 16S rRNA gene sequencing from a variety of sources. Although readily available tools for 16S rRNA sequence analysis have standardized computational workflows, sample processing for DNA extraction remains a continued source of variability across studies. Here we describe an efficient, robust, and cost effective method for extracting nucleic acid from swabs. We also delineate downstream methods for 16S rRNA gene sequencing, including generation of sequencing libraries, data quality control, and sequence analysis. The workflow can accommodate multiple samples types, including stool and swabs collected from a variety of anatomical locations and host species. Additionally, recovered DNA and RNA can be separated and used for other applications, including whole genome sequencing or RNA-seq. The method described allows for a common processing approach for multiple sample types and accommodates downstream analysis of genomic, metagenomic and transcriptional information. PMID:27168460

  17. Predicting and improving the protein sequence alignment quality by support vector regression

    PubMed Central

    Lee, Minho; Jeong, Chan-seok; Kim, Dongsup

    2007-01-01

    Background For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment. Results In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs. Conclusion The present work demonstrates that the alignment quality can be

  18. Preparation of Nucleic Acid Libraries for Personalized Sequencing Systems Using an Integrated Microfluidic Hub Technology (Seventh Annual Sequencing, Finishing, Analysis in the Future (SFAF) Meeting 2012)

    ScienceCinema

    Patel, Kamlesh D [Ken]; SNL,

    2013-01-25

    Kamlesh (Ken) Patel from Sandia National Laboratories (Livermore, California) presents "Preparation of Nucleic Acid Libraries for Personalized Sequencing Systems Using an Integrated Microfluidic Hub Technology " at the 7th Annual Sequencing, Finishing, Analysis in the Future (SFAF) Meeting held in June, 2012 in Santa Fe, NM.

  19. The amino acid sequence of ribonuclease U2 from Ustilago sphaerogena.

    PubMed Central

    Sato, S; Uchida, T

    1975-01-01

    1. RNAase (ribonuclease) U2, a purine-specific RNAase, was reduced, aminoethylated and hydrolysed with trypsin, chymotrypsin and thermolysin. On the basis of the analyses of the resulting peptides, the complete amino acid sequence of RNAase U2 was determined, 2. When the sequence was compared with the amino acid sequence of RNAase T1 (EC 3.1.4.8), the following regions were found to be similar in the two enzymes; Tyr-Pro-His-Gln-Tyr (38-42) in RNAase U2 and Tyr-Pro-His-Lys-Tyr (38-42) in RNAase T1, Glu-Phe-Pro-Leu-Val (61-65) in RNAase U2 and Glu-Trp-Pro-Ile-Leu (58-62) in RNAase T1, Asp-Arg-Val-Ile-Tyr-Gln (83-88) in RNAase U2 and Asp-Arg-Val-Phe-Asn (76-81) in RNAase T1 and Val-Thr-His-Thr-Gly-Ala (98-103) in RNAase U2 and Ile-Thr-His-Thr-Gly-Ala (90-95) in RNAase T1. All of the amino acid residues, histidine-40, glutamate-58, arginine-77 and histidine-92, which were found to play a crucial role in the biological activity of RNAase T1, were included in the regions cited here. 3. Detailed evidence for the amino acid sequence of the sequence of the proteins has been deposited as Supplementary Publication SUP 50041 (33 PAGES) AT THE British Library (Lending Division)(formerly the National Lending Library for Science and Technology), Boston Spa, Yorks. LS23 7BQ, U.K., from whom copies can be obtained on the terms indicated in Biochem. J. (1975), 145, 5. PMID:1156364

  20. Prediction of Type II Toxin-Antitoxin Loci in Klebsiella pneumoniae Genome Sequences.

    PubMed

    Wei, Yi-Qing; Bi, De-Xi; Wei, Dong-Qing; Ou, Hong-Yu

    2016-06-01

    Klebsiella pneumoniae is an increasingly important bacterial pathogen to human. This Gram-negative bacterium species has become a serious concern due to its dramatic increase in the levels of multiple antibiotic resistances, particularly to carbapenems. The toxin-antitoxin (TA) system has recently been reported to be involved in the formation of drug-tolerant persister cells. The type II TA system is composed of a stable toxin protein and a relatively unstable antitoxin protein that is able to inhibit the toxin. Here, we examine the type II TA locus distribution and compare the TA diversity throughout ten completely sequenced K. pneumoniae genomes by using bioinformatics approaches. Two hundred and twelve putative type II TA loci were identified in 30 replicons of these K. pneumoniae strains. The amino acid sequence similarity-based grouping shows that these loci distribute differently not only among different K. pneumoniae strains isolated from diverse sources, but also between their chromosomes and plasmids. PMID:26662948

  1. Complete nucleic acid sequence of Penaeus stylirostris densovirus (PstDNV) from India.

    PubMed

    Rai, Praveen; Safeena, Muhammed P; Karunasagar, Iddya; Karunasagar, Indrani

    2011-06-01

    Infectious hypodermal and hematopoietic necrosis virus (IHHNV) of shrimp, recently been classified as Penaeus stylirostris densovirus (PstDNV). The complete nucleic acid sequence of PstDNV from India was obtained by cloning and sequencing of different DNA fragment of the virus. The genome organisation of PstDNV revealed that there were three major coding domains: a left ORF (NS1) of 2001 bp, a mid ORF (NS2) of 1092 bp and a right ORF (VP) of 990 bp. The complete genome and amino acid sequences of three proteins viz., NS1, NS2 and VP were compared with the genomes of the virus reported from Hawaii, China and Mexico and with partial sequence available from isolates from different regions. The phylogenetic analysis of shrimp, insect and vertebrate parvovirus sequences showed that the Indian PstDNV isolate is phylogenetically more closely related to one of the three isolates from Taiwan (AY355307), and two isolates (AY362547 and AY102034) from Thailand. PMID:21402111

  2. Human liver type pyruvate kinase: complete amino acid sequence and the expression in mammalian cells.

    PubMed Central

    Tani, K; Fujii, H; Nagata, S; Miwa, S

    1988-01-01

    Pyruvate kinase (PK) has four isozymes (L, R, M1, M2) that are encoded by two different genes. Among these isozymes, abnormalities of liver (L)-type PK is considered to be associated with hereditary nonspherocytic hemolytic anemia in humans. We isolated and determined the full-length sequence of human L-type PK cDNA. The cDNA contains 1629 base pairs encoding 543 amino acids, 68 base pairs of 5'-noncoding sequence, and 734 base pairs of 3'-noncoding sequence. The similarity between human and rat L-type PK was 86.9% at the nucleotide sequence level and 92.4% at the amino acid sequence level. The full-length L-type PK cDNA was placed under the promoter of simian virus 40 and introduced into monkey COS cells. Human L-type PK activity was detected in the extract of COS cells by the classical PK electrophoresis method. Images PMID:3126495

  3. Human liver type pyruvate kinase: Complete amino acid sequence and the expression in mammalian cells

    SciTech Connect

    Tani, Kenzaburo; Nagata, Shigekazu ); Fujii, Hisaichi ); Miwa, Shiro )

    1988-03-01

    Pyruvate kinase (PK) has four isozymes (L, R, M{sub 1}, M{sub 2}) that are encoded by two different genes. Among these isozymes, abnormalities of liver (L)-type PK is considered to be associated with hereditary nonspherocytic hemolytic anemia in humans. The authors isolated and determined the full-length sequence of human L-type PK cDNA. The cDNA contains 1,629 base pairs encoding 543 amino acids, 68 base pairs of 5{prime}-noncoding sequence, and 734 base pairs of 3{prime}-noncoding sequence. The similarity between human and rat L-type PK was 86.9% at the nucleotide sequence level and 92.4% at the amino acid sequence level. The full-length L-type PK cDNA was placed under the promoter of simian virus 40 and introduced into monkey COS cells. Human L-type PK activity was detected in the extract of COS cells by the classical PK electrophoresis method.

  4. The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment.

    PubMed

    Eisenhaber, Birgit; Kuchibhatla, Durga; Sherman, Westley; Sirota, Fernanda L; Berezovsky, Igor N; Wong, Wing-Cheong; Eisenhaber, Frank

    2016-01-01

    As biomolecular sequencing is becoming the main technique in life sciences, functional interpretation of sequences in terms of biomolecular mechanisms with in silico approaches is getting increasingly significant. Function prediction tools are most powerful for protein-coding sequences; yet, the concepts and technologies used for this purpose are not well reflected in bioinformatics textbooks. Notably, protein sequences typically consist of globular domains and non-globular segments. The two types of regions require cardinally different approaches for function prediction. Whereas the former are classic targets for homology-inspired function transfer based on remnant, yet statistically significant sequence similarity to other, characterized sequences, the latter type of regions are characterized by compositional bias or simple, repetitive patterns and require lexical analysis and/or empirical sequence pattern-function correlations. The recipe for function prediction recommends first to find all types of non-globular segments and, then, to subject the remaining query sequence to sequence similarity searches. We provide an updated description of the ANNOTATOR software environment as an advanced example of a software platform that facilitates protein sequence-based function prediction. PMID:27115649

  5. Molecular cytogenetics by polymerase catalyzed amplification or in situ labelling of specific nucleic acid sequences

    SciTech Connect

    Bolund, L.; Brandt, C.; Hindkjaer, J.; Koch, J.; Koelvraa, S.; Pedersen, S. )

    1993-01-01

    The Polymerase Chain Reaction (PCR) can be performed on isolated cells or chromosomes and the product can be analyzed by DNA technology or by FISH to test metaphases. The authors have good experiences analyzing aberrant chromosomes by FACS sorting, PCR with degenerated primers and painting of test metaphases with the PCR product. They also utilize polymerases for PRimed IN Situ labelling (PRINS) of specific nucleic acid sequences. In PRINS oligonucleotides are hybridized to their target sequences and labeled nucleotides are incorporated at the site of hybridization with the oligonucleotide as primer. PRINS may eventually allow the study of individual genes, gene expression and even somatic mutations (in mRNA) in single cells.

  6. DNA Cloning of Plasmodium falciparum Circumsporozoite Gene: Amino Acid Sequence of Repetitive Epitope

    NASA Astrophysics Data System (ADS)

    Enea, Vincenzo; Ellis, Joan; Zavala, Fidel; Arnot, David E.; Asavanich, Achara; Masuda, Aoi; Quakyi, Isabella; Nussenzweig, Ruth S.

    1984-08-01

    A clone of complementary DNA encoding the circumsporozoite (CS) protein of the human malaria parasite Plasmodium falciparum has been isolated by screening an Escherichia coli complementary DNA library with a monoclonal antibody to the CS protein. The DNA sequence of the complementary DNA insert encodes a four-amino acid sequence: proline-asparagine-alanine-asparagine, tandemly repeated 23 times. The CS β -lactamase fusion protein specifically binds monoclonal antibodies to the CS protein and inhibits the binding of these antibodies to native Plasmodium falciparum CS protein. These findings provide a basis for the development of a vaccine against Plasmodium falciparum malaria.

  7. Method for high-volume sequencing of nucleic acids: random and directed priming with libraries of oligonucleotides

    DOEpatents

    Studier, F.W.

    1995-04-18

    Random and directed priming methods for determining nucleotide sequences by enzymatic sequencing techniques, using libraries of primers of lengths 8, 9 or 10 bases, are disclosed. These methods permit direct sequencing of nucleic acids as large as 45,000 base pairs or larger without the necessity for subcloning. Individual primers are used repeatedly to prime sequence reactions in many different nucleic acid molecules. Libraries containing as few as 10,000 octamers, 14,200 nonamers, or 44,000 decamers would have the capacity to determine the sequence of almost any cosmid DNA. Random priming with a fixed set of primers from a smaller library can also be used to initiate the sequencing of individual nucleic acid molecules, with the sequence being completed by directed priming with primers from the library. In contrast to random cloning techniques, a combined random and directed priming strategy is far more efficient. 2 figs.

  8. Method for high-volume sequencing of nucleic acids: random and directed priming with libraries of oligonucleotides

    DOEpatents

    Studier, F. William

    1995-04-18

    Random and directed priming methods for determining nucleotide sequences by enzymatic sequencing techniques, using libraries of primers of lengths 8, 9 or 10 bases, are disclosed. These methods permit direct sequencing of nucleic acids as large as 45,000 base pairs or larger without the necessity for subcloning. Individual primers are used repeatedly to prime sequence reactions in many different nucleic acid molecules. Libraries containing as few as 10,000 octamers, 14,200 nonamers, or 44,000 decamers would have the capacity to determine the sequence of almost any cosmid DNA. Random priming with a fixed set of primers from a smaller library can also be used to initiate the sequencing of individual nucleic acid molecules, with the sequence being completed by directed priming with primers from the library. In contrast to random cloning techniques, a combined random and directed priming strategy is far more efficient.

  9. Predicting candidate genomic sequences that correspond to synthetic functional RNA motifs

    PubMed Central

    Laserson, Uri; Gan, Hin Hark; Schlick, Tamar

    2005-01-01

    Riboswitches and RNA interference are important emerging mechanisms found in many organisms to control gene expression. To enhance our understanding of such RNA roles, finding small regulatory motifs in genomes presents a challenge on a wide scale. Many simple functional RNA motifs have been found by in vitro selection experiments, which produce synthetic target-binding aptamers as well as catalytic RNAs, including the hammerhead ribozyme. Motivated by the prediction of Piganeau and Schroeder [(2003) Chem. Biol., 10, 103–104] that synthetic RNAs may have natural counterparts, we develop and apply an efficient computational protocol for identifying aptamer-like motifs in genomes. We define motifs from the sequence and structural information of synthetic aptamers, search for sequences in genomes that will produce motif matches, and then evaluate the structural stability and statistical significance of the potential hits. Our application to aptamers for streptomycin, chloramphenicol, neomycin B and ATP identifies 37 candidate sequences (in coding and non-coding regions) that fold to the target aptamer structures in bacterial and archaeal genomes. Further energetic screening reveals that several candidates exhibit energetic properties and sequence conservation patterns that are characteristic of functional motifs. Besides providing candidates for experimental testing, our computational protocol offers an avenue for expanding natural RNA's functional repertoire. PMID:16254081

  10. Partial amino acid sequence of apolipoprotein(a) shows that it is homologous to plasminogen

    SciTech Connect

    Eaton, D.L.; Fless, G.M.; Kohr, W.J.; McLean, J.W.; Xu, Q.T.; Miller, C.G.; Lawn, R.M.; Scanu, A.M.

    1987-05-01

    Apolipoprotein(a) (apo(a)) is a glycoprotein with M/sub r/ approx. 280,000 that is disulfide linked to apolipoprotein B in lipoprotein(a) particles. Elevated plasma levels of lipoprotein(a) are correlated with atherosclerosis. Partial amino acid sequence of apo(a) shows that it has striking homology to plasminogen. Plasminogen is a plasma serine protease zymogen that consists of five homologous and tandemly repeated domains called kringles and a trypsin-like protease domain. The amino-terminal sequence obtained for apo(a) is homologous to the beginning of kringle 4 but not the amino terminus of plasminogen. Apo(a) was subjected to limited proteolysis by trypsin or V8 protease, and fragments generated were isolated and sequenced. Sequences obtained from several of these fragments are highly (77-100%) homologous to plasminogen residues 391-421, which reside within kringle 4. Analysis of these internal apo(a) sequences revealed that apo(a) may contain at least two kringle 4-like domains. A sequence obtained from another tryptic fragment also shows homology to the end of kringle 4 and the beginning of kringle 5. Sequence data obtained from the two tryptic fragments shows homology with the protease domain of plasminogen. One of these sequences is homologous to the sequences surrounding the activation site of plasminogen. Plasminogen is activated by the cleavage of a specific arginine residue by urokinase and tissue plasminogen activator; however, the corresponding site in apo(a) is a serine that would not be cleaved by tissue plasminogen activator or urokinase. Using a plasmin-specific assay, no proteolytic activity could be demonstrated for lipoprotein(a) particles. These results suggest that apo(a) contains kringle-like domains and an inactive protease domain.

  11. Integrating sequence stratigraphy and rock-physics to interpret seismic amplitudes and predict reservoir quality

    NASA Astrophysics Data System (ADS)

    Dutta, Tanima

    This dissertation focuses on the link between seismic amplitudes and reservoir properties. Prediction of reservoir properties, such as sorting, sand/shale ratio, and cement-volume from seismic amplitudes improves by integrating knowledge from multiple disciplines. The key contribution of this dissertation is to improve the prediction of reservoir properties by integrating sequence stratigraphy and rock physics. Sequence stratigraphy has been successfully used for qualitative interpretation of seismic amplitudes to predict reservoir properties. Rock physics modeling allows quantitative interpretation of seismic amplitudes. However, often there is uncertainty about selecting geologically appropriate rock physics model and its input parameters, away from the wells. In the present dissertation, we exploit the predictive power of sequence stratigraphy to extract the spatial trends of sedimentological parameters that control seismic amplitudes. These spatial trends of sedimentological parameters can serve as valuable constraints in rock physics modeling, especially away from the wells. Consequently, rock physics modeling, integrated with the trends from sequence stratigraphy, become useful for interpreting observed seismic amplitudes away from the wells in terms of underlying sedimentological parameters. We illustrate this methodology using a comprehensive dataset from channelized turbidite systems, deposited in minibasin settings in the offshore Equatorial Guinea, West Africa. First, we present a practical recipe for using closed-form expressions of effective medium models to predict seismic velocities in unconsolidated sandstones. We use an effective medium model that combines perfectly rough and smooth grains (the extended Walton model), and use that model to derive coordination number, porosity, and pressure relations for P and S wave velocities from experimental data. Our recipe provides reasonable fits to other experimental and borehole data, and specifically

  12. A Novel Method for Accurate Operon Predictions in All SequencedProkaryotes

    SciTech Connect

    Price, Morgan N.; Huang, Katherine H.; Alm, Eric J.; Arkin, Adam P.

    2004-12-01

    We combine comparative genomic measures and the distance separating adjacent genes to predict operons in 124 completely sequenced prokaryotic genomes. Our method automatically tailors itself to each genome using sequence information alone, and thus can be applied to any prokaryote. For Escherichia coli K12 and Bacillus subtilis, our method is 85 and 83% accurate, respectively, which is similar to the accuracy of methods that use the same features but are trained on experimentally characterized transcripts. In Halobacterium NRC-1 and in Helicobacterpylori, our method correctly infers that genes in operons are separated by shorter distances than they are in E.coli, and its predictions using distance alone are more accurate than distance-only predictions trained on a database of E.coli transcripts. We use microarray data from sixphylogenetically diverse prokaryotes to show that combining intergenic distance with comparative genomic measures further improves accuracy and that our method is broadly effective. Finally, we survey operon structure across 124 genomes, and find several surprises: H.pylori has many operons, contrary to previous reports; Bacillus anthracis has an unusual number of pseudogenes within conserved operons; and Synechocystis PCC6803 has many operons even though it has unusually wide spacings between conserved adjacent genes.

  13. Accurate single-sequence prediction of solvent accessible surface area using local and global features

    PubMed Central

    Faraggi, Eshel; Zhou, Yaoqi; Kloczkowski, Andrzej

    2014-01-01

    We present a new approach for predicting the Accessible Surface Area (ASA) using a General Neural Network (GENN). The novelty of the new approach lies in not using residue mutation profiles generated by multiple sequence alignments as descriptive inputs. Instead we use solely sequential window information and global features such as single-residue and two-residue compositions of the chain. The resulting predictor is both highly more efficient than sequence alignment based predictors and of comparable accuracy to them. Introduction of the global inputs significantly helps achieve this comparable accuracy. The predictor, termed ASAquick, is tested on predicting the ASA of globular proteins and found to perform similarly well for so-called easy and hard cases indicating generalizability and possible usability for de-novo protein structure prediction. The source code and a Linux executables for GENN and ASAquick are available from Research and Information Systems at http://mamiris.com, from the SPARKS Lab at http://sparks-lab.org, and from the Battelle Center for Mathematical Medicine at http://mathmed.org. PMID:25204636

  14. Accurate single-sequence prediction of solvent accessible surface area using local and global features.

    PubMed

    Faraggi, Eshel; Zhou, Yaoqi; Kloczkowski, Andrzej

    2014-11-01

    We present a new approach for predicting the Accessible Surface Area (ASA) using a General Neural Network (GENN). The novelty of the new approach lies in not using residue mutation profiles generated by multiple sequence alignments as descriptive inputs. Instead we use solely sequential window information and global features such as single-residue and two-residue compositions of the chain. The resulting predictor is both highly more efficient than sequence alignment-based predictors and of comparable accuracy to them. Introduction of the global inputs significantly helps achieve this comparable accuracy. The predictor, termed ASAquick, is tested on predicting the ASA of globular proteins and found to perform similarly well for so-called easy and hard cases indicating generalizability and possible usability for de-novo protein structure prediction. The source code and a Linux executables for GENN and ASAquick are available from Research and Information Systems at http://mamiris.com, from the SPARKS Lab at http://sparks-lab.org, and from the Battelle Center for Mathematical Medicine at http://mathmed.org. PMID:25204636

  15. Depositional sequence analysis and sedimentologic modeling for improved prediction of Pennsylvanian reservoirs (Annex 1)

    SciTech Connect

    Watney, W.L.

    1992-01-01

    Interdisciplinary studies of the Upper Pennsylvanian Lansing and Kansas City groups have been undertaken in order to improve the geologic characterization of petroleum reservoirs and to develop a quantitative understanding of the processes responsible for formation of associated depositional sequences. To this end, concepts and methods of sequence stratigraphy are being used to define and interpret the three-dimensional depositional framework of the Kansas City Group. The investigation includes characterization of reservoir rocks in oil fields in western Kansas, description of analog equivalents in near-surface and surface sites in southeastern Kansas, and construction of regional structural and stratigraphic framework to link the site specific studies. Geologic inverse and simulation models are being developed to integrate quantitative estimates of controls on sedimentation to produce reconstructions of reservoir-bearing strata in an attempt to enhance our ability to predict reservoir characteristics.

  16. In Vitro and In Vivo Activities of Antimicrobial Peptides Developed Using an Amino Acid-Based Activity Prediction Method

    PubMed Central

    Wu, Xiaozhe; Wang, Zhenling; Li, Xiaolu; Fan, Yingzi; He, Gu; Wan, Yang; Yu, Chaoheng; Tang, Jianying; Li, Meng; Zhang, Xian; Zhang, Hailong; Xiang, Rong; Pan, Ying; Liu, Yan; Lu, Lian

    2014-01-01

    To design and discover new antimicrobial peptides (AMPs) with high levels of antimicrobial activity, a number of machine-learning methods and prediction methods have been developed. Here, we present a new prediction method that can identify novel AMPs that are highly similar in sequence to known peptides but offer improved antimicrobial activity along with lower host cytotoxicity. Using previously generated AMP amino acid substitution data, we developed an amino acid activity contribution matrix that contained an activity contribution value for each amino acid in each position of the model peptide. A series of AMPs were designed with this method. After evaluating the antimicrobial activities of these novel AMPs against both Gram-positive and Gram-negative bacterial strains, DP7 was chosen for further analysis. Compared to the parent peptide HH2, this novel AMP showed broad-spectrum, improved antimicrobial activity, and in a cytotoxicity assay it showed lower toxicity against human cells. The in vivo antimicrobial activity of DP7 was tested in a Staphylococcus aureus infection murine model. When inoculated and treated via intraperitoneal injection, DP7 reduced the bacterial load in the peritoneal lavage solution. Electron microscope imaging and the results indicated disruption of the S. aureus outer membrane by DP7. Our new prediction method can therefore be employed to identify AMPs possessing minor amino acid differences with improved antimicrobial activities, potentially increasing the therapeutic agents available to combat multidrug-resistant infections. PMID:24982064

  17. The Complete Genome Sequence of the Lactic Acid Bacterium Lactococcus lactis ssp. lactis IL1403

    PubMed Central

    Bolotin, Alexander; Wincker, Patrick; Mauger, Stéphane; Jaillon, Olivier; Malarme, Karine; Weissenbach, Jean; Ehrlich, S. Dusko; Sorokin, Alexei

    2001-01-01

    Lactococcus lactis is a nonpathogenic AT-rich gram-positive bacterium closely related to the genus Streptococcus and is the most commonly used cheese starter. It is also the best-characterized lactic acid bacterium. We sequenced the genome of the laboratory strain IL1403, using a novel two-step strategy that comprises diagnostic sequencing of the entire genome and a shotgun polishing step. The genome contains 2,365,589 base pairs and encodes 2310 proteins, including 293 protein-coding genes belonging to six prophages and 43 insertion sequence (IS) elements. Nonrandom distribution of IS elements indicates that the chromosome of the sequenced strain may be a product of recent recombination between two closely related genomes. A complete set of late competence genes is present, indicating the ability of L. lactis to undergo DNA transformation. Genomic sequence revealed new possibilities for fermentation pathways and for aerobic respiration. It also indicated a horizontal transfer of genetic information from Lactococcus to gram-negative enteric bacteria of Salmonella-Escherichia group. [The sequence data described in this paper has been submitted to the GenBank data library under accession no. AE005176.] PMID:11337471

  18. On human disease-causing amino acid variants: statistical study of sequence and structural patterns

    PubMed Central

    Alexov, Emil

    2015-01-01

    Statistical analysis was carried out on large set of naturally occurring human amino acid variations and it was demonstrated that there is a preference for some amino acid substitutions to be associated with diseases. At an amino acid sequence level, it was shown that the disease-causing variants frequently involve drastic changes of amino acid physico-chemical properties of proteins such as charge, hydrophobicity and geometry. Structural analysis of variants involved in diseases and being frequently observed in human population showed similar trends: disease-causing variants tend to cause more changes of hydrogen bond network and salt bridges as compared with harmless amino acid mutations. Analysis of thermodynamics data reported in literature, both experimental and computational, indicated that disease-causing variants tend to destabilize proteins and their interactions, which prompted us to investigate the effects of amino acid mutations on large databases of experimentally measured energy changes in unrelated proteins. Although the experimental datasets were linked neither to diseases nor exclusory to human proteins, the observed trends were the same: amino acid mutations tend to destabilize proteins and their interactions. Having in mind that structural and thermodynamics properties are interrelated, it is pointed out that any large change of any of them is anticipated to cause a disease. PMID:25689729

  19. Self-sequencing of amino acids and origins of polyfunctional protocells.

    PubMed

    Fox, S W

    1984-01-01

    The primal role of the origins of proteins in molecular evolution is discussed. On the basis of this premise, the significance of the experimentally established self-sequencing of amino acids under simulated geological conditions is explained as due to the fact that the products are highly nonrandom and accordingly contain many kinds of information. When such thermal proteins are aggregated into laboratory protocells, an action that occurs readily, the resultant protocells also contain many kinds of information. Residue-by-residue order, enzymic activities, and lipid quality accordingly occur within each preparation of proteinoid (thermal protein). In this paper are reviewed briefly the phenomenon of self-sequencing of amino acids, its relationship to evolutionary processes, other significance of such self-ordering, and the experimental evidence for original polyfunctional protocells. PMID:6462684

  20. Self-Sequencing of Amino Acids and Origins of Polyfunctional Protocells

    NASA Astrophysics Data System (ADS)

    Fox, Sidney W.

    1984-12-01

    The primal role of the origins of proteins in molecular evolution is discussed. On the basis of this premise, the significance of the experimentally established self-sequencing of amino acids under simulated geological conditions is explained as due to the fact that the products are highly nonrandom and accordingly contain many kinds of information. When such thermal proteins are aggregated into laboratory protocells, an action that occurs readily, the resultant protocells also contain many kinds of information. Residue-by-residue order, enzymic activities, and lipid quality accordingly occur within each preparation of proteinoid (thermal protein). In this paper are reviewed briefly the phenomenon of self-sequencing of amino acids, its relationship to evolutionary processes, other significance of such self-ordering, and the experimental evidence for original polyfunctional protocells.

  1. Computer analysis between nucleotide and amino acid sequences of bean golden mosaic virus and those of maize streak, wheat dwarf, chloris striate mosaic, and beet curly top viruses.

    PubMed

    Ikegami, M

    1989-01-01

    Bean golden mosaic virus (BGMV) DNA 1 and 2 have little sequence homology with maize streak virus (MSV), wheat dwarf virus (WDV), and chloris striate mosaic virus (CSMV) DNAs. BGMV DNA 1 and beet curly top virus (BCTV) DNA are closely related, whereas BGMV DNA 2 and BCTV DNA are not related. Direct amino acid homologies of predicted proteins between BGMV ORFs and MSV ORFs, WDV ORFs or CSMV ORFs were 40-50%. BGMV 1L1 and BCTV L1, and BGMV IL3 and BCTV L4 were highly conserved. The sequence TAATATTAC was detected in the loops of hairpin structures of 5 gemini-viruses. PMID:2615677

  2. Prediction of enzyme function based on 3D templates of evolutionarily important amino acids

    PubMed Central

    Kristensen, David M; Ward, R Matthew; Lisewski, Andreas Martin; Erdin, Serkan; Chen, Brian Y; Fofanov, Viacheslav Y; Kimmel, Marek; Kavraki, Lydia E; Lichtarge, Olivier

    2008-01-01

    Background Structural genomics projects such as the Protein Structure Initiative (PSI) yield many new structures, but often these have no known molecular functions. One approach to recover this information is to use 3D templates – structure-function motifs that consist of a few functionally critical amino acids and may suggest functional similarity when geometrically matched to other structures. Since experimentally determined functional sites are not common enough to define 3D templates on a large scale, this work tests a computational strategy to select relevant residues for 3D templates. Results Based on evolutionary information and heuristics, an Evolutionary Trace Annotation (ETA) pipeline built templates for 98 enzymes, half taken from the PSI, and sought matches in a non-redundant structure database. On average each template matched 2.7 distinct proteins, of which 2.0 share the first three Enzyme Commission digits as the template's enzyme of origin. In many cases (61%) a single most likely function could be predicted as the annotation with the most matches, and in these cases such a plurality vote identified the correct function with 87% accuracy. ETA was also found to be complementary to sequence homology-based annotations. When matches are required to both geometrically match the 3D template and to be sequence homologs found by BLAST or PSI-BLAST, the annotation accuracy is greater than either method alone, especially in the region of lower sequence identity where homology-based annotations are least reliable. Conclusion These data suggest that knowledge of evolutionarily important residues improves functional annotation among distant enzyme homologs. Since, unlike other 3D template approaches, the ETA method bypasses the need for experimental knowledge of the catalytic mechanism, it should prove a useful, large scale, and general adjunct to combine with other methods to decipher protein function in the structural proteome. PMID:18190718

  3. Chemical genomic profiling via barcode sequencing to predict compound mode of action

    PubMed Central

    Piotrowski, Jeff S.; Simpkins, Scott W.; Li, Sheena C.; Deshpande, Raamesh; McIlwain, Sean; Ong, Irene; Myers, Chad L.; Boone, Charlie; Andersen, Raymond J.

    2015-01-01

    Summary Chemical genomics is an unbiased, whole-cell approach to characterizing novel compounds to determine mode of action and cellular target. Our version of this technique is built upon barcoded deletion mutants of Saccharomyces cerevisiae and has been adapted to a high-throughput methodology using next-generation sequencing. Here we describe the steps to generate a chemical genomic profile from a compound of interest, and how to use this information to predict molecular mechanism and targets of bioactive compounds. PMID:25618354

  4. Sequence of morphological transitions in two-dimensional pattern growth from aqueous ascorbic Acid solutions.

    PubMed

    Paranjpe, A S

    2002-08-12

    A sequence of morphological transitions in two-dimensional dehydration patterns of aqueous solutions of ascorbic acid is observed with humidity as a control parameter. Change in morphology occurs due to humidity induced variation in the concentration of the metastable supersaturated solution phase formed after initial solvent evaporation. As percent humidity is varied from 40 to 80, patterns change from compact circular --> radial --> density modulated radial (a new morphology) --> density modulated circular --> density modulated dendritic (a new morphology) --> dense branching. PMID:12190528

  5. Self-sequencing of amino acids and origins of polyfunctional protocells

    NASA Technical Reports Server (NTRS)

    Fox, S. W.

    1984-01-01

    The role of proteins in the origin of living things is discussed. It has been experimentally established that amino acids can sequence themselves under simulated geological conditions with highly nonrandom products which accordingly contain diverse information. Multiple copies of each type of macromolecule are formed, resulting in greater power for any protoenzymic molecule than would accrue from a single copy of each type. Thermal proteins are readily incorporated into laboratory protocells. The experimental evidence for original polyfunctional protocells is discussed.

  6. Snake venom. The amino acid sequence of protein A from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J; Strydom, D J

    1980-12-01

    Protein A from Dendroaspis polylepis polylepis venom comprises 81 amino acids, including ten half-cystine residues. The complete primary structures of protein A and its variant A' were elucidated. The sequences of proteins A and A', which differ in a single position, show no homology with various neurotoxins and non-neurotoxic proteins and represent a new type of elapid venom protein. PMID:7461607

  7. Characterization of the microbial acid mine drainage microbial community using culturing and direct sequencing techniques.

    PubMed

    Auld, Ryan R; Myre, Maxine; Mykytczuk, Nadia C S; Leduc, Leo G; Merritt, Thomas J S

    2013-05-01

    We characterized the bacterial community from an AMD tailings pond using both classical culturing and modern direct sequencing techniques and compared the two methods. Acid mine drainage (AMD) is produced by the environmental and microbial oxidation of minerals dissolved from mining waste. Surprisingly, we know little about the microbial communities associated with AMD, despite the fundamental ecological roles of these organisms and large-scale economic impact of these waste sites. AMD microbial communities have classically been characterized by laboratory culturing-based techniques and more recently by direct sequencing of marker gene sequences, primarily the 16S rRNA gene. In our comparison of the techniques, we find that their results are complementary, overall indicating very similar community structure with similar dominant species, but with each method identifying some species that were missed by the other. We were able to culture the majority of species that our direct sequencing results indicated were present, primarily species within the Acidithiobacillus and Acidiphilium genera, although estimates of relative species abundance were only obtained from direct sequencing. Interestingly, our culture-based methods recovered four species that had been overlooked from our sequencing results because of the rarity of the marker gene sequences, likely members of the rare biosphere. Further, direct sequencing indicated that a single genus, completely missed in our culture-based study, Legionella, was a dominant member of the microbial community. Our results suggest that while either method does a reasonable job of identifying the dominant members of the AMD microbial community, together the methods combine to give a more complete picture of the true diversity of this environment. PMID:23485423

  8. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... approved by the Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51... base or modified or unusual amino acid may be presented in a given sequence as the corresponding unmodified base or amino acid if the modified base or modified or unusual amino acid is one of those...

  9. 37 CFR 1.822 - Symbols and format to be used for nucleotide and/or amino acid sequence data.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... approved by the Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51... base or modified or unusual amino acid may be presented in a given sequence as the corresponding unmodified base or amino acid if the modified base or modified or unusual amino acid is one of those...

  10. Nanopore Analysis of Nucleic Acids: Single-Molecule Studies of Molecular Dynamics, Structure, and Base Sequence

    NASA Astrophysics Data System (ADS)

    Olasagasti, Felix; Deamer, David W.

    Nucleic acids are linear polynucleotides in which each base is covalently linked to a pentose sugar and a phosphate group carrying a negative charge. If a pore having roughly the crosssectional diameter of a single-stranded nucleic acid is embedded in a thin membrane and a voltage of 100 mV or more is applied, individual nucleic acids in solution can be captured by the electrical field in the pore and translocated through by single-molecule electrophoresis. The dimensions of the pore cannot accommodate anything larger than a single strand, so each base in the molecule passes through the pore in strict linear sequence. The nucleic acid strand occupies a large fraction of the pore's volume during translocation and therefore produces a transient blockade of the ionic current created by the applied voltage. If it could be demonstrated that each nucleotide in the polymer produced a characteristic modulation of the ionic current during its passage through the nanopore, the sequence of current modulations would reflect the sequence of bases in the polymer. According to this basic concept, nanopores are analogous to a Coulter counter that detects nanoscopic molecules rather than microscopic [1,2]. However, the advantage of nanopores is that individual macromolecules can be characterized because different chemical and physical properties affect their passage through the pore. Because macromolecules can be captured in the pore as well as translocated, the nanopore can be used to detect individual functional complexes that form between a nucleic acid and an enzyme. No other technique has this capability.

  11. Accurate ab initio prediction of NMR chemical shifts of nucleic acids and nucleic acids/protein complexes

    PubMed Central

    Victora, Andrea; Möller, Heiko M.; Exner, Thomas E.

    2014-01-01

    NMR chemical shift predictions based on empirical methods are nowadays indispensable tools during resonance assignment and 3D structure calculation of proteins. However, owing to the very limited statistical data basis, such methods are still in their infancy in the field of nucleic acids, especially when non-canonical structures and nucleic acid complexes are considered. Here, we present an ab initio approach for predicting proton chemical shifts of arbitrary nucleic acid structures based on state-of-the-art fragment-based quantum chemical calculations. We tested our prediction method on a diverse set of nucleic acid structures including double-stranded DNA, hairpins, DNA/protein complexes and chemically-modified DNA. Overall, our quantum chemical calculations yield highly/very accurate predictions with mean absolute deviations of 0.3–0.6 ppm and correlation coefficients (r2) usually above 0.9. This will allow for identifying misassignments and validating 3D structures. Furthermore, our calculations reveal that chemical shifts of protons involved in hydrogen bonding are predicted significantly less accurately. This is in part caused by insufficient inclusion of solvation effects. However, it also points toward shortcomings of current force fields used for structure determination of nucleic acids. Our quantum chemical calculations could therefore provide input for force field optimization. PMID:25404135

  12. Accurate ab initio prediction of NMR chemical shifts of nucleic acids and nucleic acids/protein complexes.

    PubMed

    Victora, Andrea; Möller, Heiko M; Exner, Thomas E

    2014-12-16

    NMR chemical shift predictions based on empirical methods are nowadays indispensable tools during resonance assignment and 3D structure calculation of proteins. However, owing to the very limited statistical data basis, such methods are still in their infancy in the field of nucleic acids, especially when non-canonical structures and nucleic acid complexes are considered. Here, we present an ab initio approach for predicting proton chemical shifts of arbitrary nucleic acid structures based on state-of-the-art fragment-based quantum chemical calculations. We tested our prediction method on a diverse set of nucleic acid structures including double-stranded DNA, hairpins, DNA/protein complexes and chemically-modified DNA. Overall, our quantum chemical calculations yield highly/very accurate predictions with mean absolute deviations of 0.3-0.6 ppm and correlation coefficients (r(2)) usually above 0.9. This will allow for identifying misassignments and validating 3D structures. Furthermore, our calculations reveal that chemical shifts of protons involved in hydrogen bonding are predicted significantly less accurately. This is in part caused by insufficient inclusion of solvation effects. However, it also points toward shortcomings of current force fields used for structure determination of nucleic acids. Our quantum chemical calculations could therefore provide input for force field optimization. PMID:25404135

  13. Extremely Acidophilic Protists from Acid Mine Drainage Host Rickettsiales-Lineage Endosymbionts That Have Intervening Sequences in Their 16S rRNA Genes

    PubMed Central

    Baker, Brett J.; Hugenholtz, Philip; Dawson, Scott C.; Banfield, Jillian F.

    2003-01-01

    During a molecular phylogenetic survey of extremely acidic (pH < 1), metal-rich acid mine drainage habitats in the Richmond Mine at Iron Mountain, Calif., we detected 16S rRNA gene sequences of a novel bacterial group belonging to the order Rickettsiales in the Alphaproteobacteria. The closest known relatives of this group (92% 16S rRNA gene sequence identity) are endosymbionts of the protist Acanthamoeba. Oligonucleotide 16S rRNA probes were designed and used to observe members of this group within acidophilic protists. To improve visualization of eukaryotic populations in the acid mine drainage samples, broad-specificity probes for eukaryotes were redesigned and combined to highlight this component of the acid mine drainage community. Approximately 4% of protists in the acid mine drainage samples contained endosymbionts. Measurements of internal pH of the protists showed that their cytosol is close to neutral, indicating that the endosymbionts may be neutrophilic. The endosymbionts had a conserved 273-nucleotide intervening sequence (IVS) in variable region V1 of their 16S rRNA genes. The IVS does not match any sequence in current databases, but the predicted secondary structure forms well-defined stem loops. IVSs are uncommon in rRNA genes and appear to be confined to bacteria living in close association with eukaryotes. Based on the phylogenetic novelty of the endosymbiont sequences and initial culture-independent characterization, we propose the name “Candidatus Captivus acidiprotistae.” To our knowledge, this is the first report of an endosymbiotic relationship in an extremely acidic habitat. PMID:12957940

  14. An integrative approach to predicting the functional effects of non-coding and coding sequence variation

    PubMed Central

    Shihab, Hashem A.; Rogers, Mark F.; Gough, Julian; Mort, Matthew; Cooper, David N.; Day, Ian N. M.; Gaunt, Tom R.; Campbell, Colin

    2015-01-01

    Motivation: Technological advances have enabled the identification of an increasingly large spectrum of single nucleotide variants within the human genome, many of which may be associated with monogenic disease or complex traits. Here, we propose an integrative approach, named FATHMM-MKL, to predict the functional consequences of both coding and non-coding sequence variants. Our method utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source. Results: We show that our method outperforms current state-of-the-art algorithms, CADD and GWAVA, when predicting the functional consequences of non-coding variants. In addition, FATHMM-MKL is comparable to the best of these algorithms when predicting the impact of coding variants. The method includes a confidence measure to rank order predictions. Availability and implementation: The FATHMM-MKL webserver is available at: http://fathmm.biocompute.org.uk Contact: H.Shihab@bristol.ac.uk or Mark.Rogers@bristol.ac.uk or C.Campbell@bristol.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25583119

  15. Ligand Similarity Complements Sequence, Physical Interaction, and Co-Expression for Gene Function Prediction

    PubMed Central

    Shoichet, Brian K.; Gillis, Jesse

    2016-01-01

    The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63–0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited. PMID:27467773

  16. The amino acid sequence of Lady Amherst's pheasant (Chrysolophus amherstiae) and golden pheasant (Chrysolophus pictus) egg-white lysozymes.

    PubMed

    Araki, T; Kuramoto, M; Torikata, T

    1990-09-01

    The amino acids of Lady Amherst's pheasant and golden pheasant egg-white lysozymes have been sequenced. The carboxymethylated lysozymes were digested with trypsin followed by sequencing of the tryptic peptides. Lady Amherst's pheasant lysozyme proved to consist of 129 amino acid residues, and a relative molecular mass of 14,423 Da was calculated. This lysozyme had 6 amino acids substitutions when compared with hen egg-white lysozyme: Phe3 to Tyr, His15 to Leu, Gln41 to His, Asn77 to His, Gln 121 to Asn, and a newly found substitution of Ile124 to Thr. The amino acid sequence of golden pheasant lysozyme was identical to that of Lady Amherst's phesant lysozyme. The phylogenetic tree constructured by the comparison of amino acid sequences of phasianoid birds lysozymes revealed a minimum genetic distance between these pheasants and the turkey-peafowl group. PMID:1368578

  17. Complete Genome Sequence of a thermotolerant sporogenic lactic acid bacterium, Bacillus coagulans strain 36D1

    PubMed Central

    Rhee, Mun Su; Moritz, Brélan E.; Xie, Gary; Glavina del Rio, T.; Dalin, E.; Tice, H.; Bruce, D.; Goodwin, L.; Chertkov, O.; Brettin, T.; Han, C.; Detter, C.; Pitluck, S.; Land, Miriam L.; Patel, Milind; Ou, Mark; Harbrucker, Roberta; Ingram, Lonnie O.; Shanmugam, K. T.

    2011-01-01

    Bacillus coagulans is a ubiquitous soil bacterium that grows at 50-55 °C and pH 5.0 and ferments various sugars that constitute plant biomass to L (+)-lactic acid. The ability of this sporogenic lactic acid bacterium to grow at 50-55 °C and pH 5.0 makes this organism an attractive microbial biocatalyst for production of optically pure lactic acid at industrial scale not only from glucose derived from cellulose but also from xylose, a major constituent of hemicellulose. This bacterium is also considered as a potential probiotic. Complete genome sequence of a representative strain, B. coagulans strain 36D1, is presented and discussed. PMID:22675583

  18. Complete amino acid sequence of globin chains and biological activity of fragmented crocodile hemoglobin (Crocodylus siamensis).

    PubMed

    Srihongthong, Saowaluck; Pakdeesuwan, Anawat; Daduang, Sakda; Araki, Tomohiro; Dhiravisit, Apisak; Thammasirirak, Sompong

    2012-08-01

    Hemoglobin, α-chain, β-chain and fragmented hemoglobin of Crocodylus siamensis demonstrated both antibacterial and antioxidant activities. Antibacterial and antioxidant properties of the hemoglobin did not depend on the heme structure but could result from the compositions of amino acid residues and structures present in their primary structure. Furthermore, thirteen purified active peptides were obtained by RP-HPLC analyses, corresponding to fragments in the α-globin chain and the β-globin chain which are mostly located at the N-terminal and C-terminal parts. These active peptides operate on the bacterial cell membrane. The globin chains of Crocodylus siamensis showed similar amino acids to the sequences of Crocodylus niloticus. The novel amino acid substitutions of α-chain and β-chain are not associated with the heme binding site or the bicarbonate ion binding site, but could be important through their interactions with membranes of bacteria. PMID:22648692

  19. [Partial sequence homology of FtsZ in phylogenetics analysis of lactic acid bacteria].

    PubMed

    Zhang, Bin; Dong, Xiu-zhu

    2005-10-01

    FtsZ is a structurally conserved protein, which is universal among the prokaryotes. It plays a key role in prokaryote cell division. A partial fragment of the ftsZ gene about 800bp in length was amplified and sequenced and a partial FtsZ protein phylogenetic tree for the lactic acid bacteria was constructed. By comparing the FtsZ phylogenetic tree with the 16S rDNA tree, it was shown that the two trees were similar in topology. Both trees revealed that Pediococcus spp. were closely related with L. casei group of Lactobacillus spp. , but less related with other lactic acid cocci such as Enterococcus and Streptococcus. The results also showed that the discriminative power of FtsZ was higher than that of 16S rDNA for either inter-species or inter-genus and could be a very useful tool in species identification of lactic acid bacteria. PMID:16342751

  20. Predicting most probable conformations of a given peptide sequence in the random coil state.

    PubMed

    Bayrak, Cigdem Sevim; Erman, Burak

    2012-11-01

    In this work, we present a computational scheme for finding high probability conformations of peptides. The scheme calculates the probability of a given conformation of the given peptide sequence using the probability distribution of torsion states. Dependence of the states of a residue on the states of its first neighbors along the chain is considered. Prior probabilities of torsion states are obtained from a coil library. Posterior probabilities are calculated by the matrix multiplication Rotational Isomeric States Model of polymer theory. The conformation of a peptide with highest probability is determined by using a hidden Markov model Viterbi algorithm. First, the probability distribution of the torsion states of the residues is obtained. Using the highest probability torsion state, one can generate, step by step, states with lower probabilities. To validate the method, the highest probability state of residues in a given sequence is calculated and compared with probabilities obtained from the Coil Databank. Predictions based on the method are 32% better than predictions based on the most probable states of residues. The ensemble of "n" high probability conformations of a given protein is also determined using the Viterbi algorithm with multistep backtracking. PMID:22955874

  1. The Ising model for prediction of disordered residues from protein sequence alone

    NASA Astrophysics Data System (ADS)

    Lobanov, Michail Yu; Galzitskaya, Oxana V.

    2011-06-01

    Intrinsically disordered regions serve as molecular recognition elements, which play an important role in the control of many cellular processes and signaling pathways. It is useful to be able to predict positions of disordered residues and disordered regions in protein chains using protein sequence alone. A new method (IsUnstruct) based on the Ising model for prediction of disordered residues from protein sequence alone has been developed. According to this model, each residue can be in one of two states: ordered or disordered. The model is an approximation of the Ising model in which the interaction term between neighbors has been replaced by a penalty for changing between states (the energy of border). The IsUnstruct has been compared with other available methods and found to perform well. The method correctly finds 77% of disordered residues as well as 87% of ordered residues in the CASP8 database, and 72% of disordered residues as well as 85% of ordered residues in the DisProt database.

  2. Combining sequence-based prediction methods and circular dichroism and infrared spectroscopic data to improve protein secondary structure determinations

    PubMed Central

    Lees, Jonathan G; Janes, Robert W

    2008-01-01

    Background A number of sequence-based methods exist for protein secondary structure prediction. Protein secondary structures can also be determined experimentally from circular dichroism, and infrared spectroscopic data using empirical analysis methods. It has been proposed that comparable accuracy can be obtained from sequence-based predictions as from these biophysical measurements. Here we have examined the secondary structure determination accuracies of sequence prediction methods with the empirically determined values from the spectroscopic data on datasets of proteins for which both crystal structures and spectroscopic data are available. Results In this study we show that the sequence prediction methods have accuracies nearly comparable to those of spectroscopic methods. However, we also demonstrate that combining the spectroscopic and sequences techniques produces significant overall improvements in secondary structure determinations. In addition, combining the extra information content available from synchrotron radiation circular dichroism data with sequence methods also shows improvements. Conclusion Combining sequence prediction with experimentally determined spectroscopic methods for protein secondary structure content significantly enhances the accuracy of the overall results obtained. PMID:18197968

  3. CoRAL: predicting non-coding RNAs from small RNA-sequencing data.

    PubMed

    Leung, Yuk Yee; Ryvkin, Paul; Ungar, Lyle H; Gregory, Brian D; Wang, Li-San

    2013-08-01

    The surprising observation that virtually the entire human genome is transcribed means we know little about the function of many emerging classes of RNAs, except their astounding diversities. Traditional RNA function prediction methods rely on sequence or alignment information, which are limited in their abilities to classify the various collections of non-coding RNAs (ncRNAs). To address this, we developed Classification of RNAs by Analysis of Length (CoRAL), a machine learning-based approach for classification of RNA molecules. CoRAL uses biologically interpretable features including fragment length and cleavage specificity to distinguish between different ncRNA populations. We evaluated CoRAL using genome-wide small RNA sequencing data sets from four human tissue types and were able to classify six different types of RNAs with ∼80% cross-validation accuracy. Analysis by CoRAL revealed that microRNAs, small nucleolar and transposon-derived RNAs are highly discernible and consistent across all human tissue types assessed, whereas long intergenic ncRNAs, small cytoplasmic RNAs and small nuclear RNAs show less consistent patterns. The ability to reliably annotate loci across tissue types demonstrates the potential of CoRAL to characterize ncRNAs using small RNA sequencing data in less well-characterized organisms. PMID:23700308

  4. Comparative characterization of random-sequence proteins consisting of 5, 12, and 20 kinds of amino acids.

    PubMed

    Tanaka, Junko; Doi, Nobuhide; Takashima, Hideaki; Yanagawa, Hiroshi

    2010-04-01

    Screening of functional proteins from a random-sequence library has been used to evolve novel proteins in the field of evolutionary protein engineering. However, random-sequence proteins consisting of the 20 natural amino acids tend to aggregate, and the occurrence rate of functional proteins in a random-sequence library is low. From the viewpoint of the origin of life, it has been proposed that primordial proteins consisted of a limited set of amino acids that could have been abundantly formed early during chemical evolution. We have previously found that members of a random-sequence protein library constructed with five primitive amino acids show high solubility (Doi et al., Protein Eng Des Sel 2005;18:279-284). Although such a library is expected to be appropriate for finding functional proteins, the functionality may be limited, because they have no positively charged amino acid. Here, we constructed three libraries of 120-amino acid, random-sequence proteins using alphabets of 5, 12, and 20 amino acids by preselection using mRNA display (to eliminate sequences containing stop codons and frameshifts) and characterized and compared the structural properties of random-sequence proteins arbitrarily chosen from these libraries. We found that random-sequence proteins constructed with the 12-member alphabet (including five primitive amino acids and positively charged amino acids) have higher solubility than those constructed with the 20-member alphabet, though other biophysical properties are very similar in the two libraries. Thus, a library of moderate complexity constructed from 12 amino acids may be a more appropriate resource for functional screening than one constructed from 20 amino acids. PMID:20162614

  5. Structure prediction and evolution of a halo-acid dehalogenase of Burkholderia mallei

    PubMed Central

    Rai, Alok R; Singh, Raghvendra Pratap; Srivastava, Alok Kumar; Dubey, Ramesh Chandra

    2012-01-01

    Environmental pollutants containing halogenated organic compounds e.g. haloacid, can cause a plethora of health problems. The structural and functional analyses of the gene responsible of their degradation are an important aspect for environmental studies and are important to human well-being. It has been shown that some haloacids are toxic and mutagenic. Microorganisms capable of degrading these haloacids can be found in the natural environment. One of these, a soil-borne Burkholderia mallei posses the ability to grow on monobromoacetate (MBA). This bacterium produces a haloacid dehalogenase that allows the cell to grow on MBA, a highly toxic and mutagenic environmental pollutant. For the structural and functional analysis, a 346 amino acid encoding protein sequence of haloacid dehalogenase is retrieve from NCBI data base. Primary and secondary structure analysis suggested that the high percentage of helices in the structure makes the protein more flexible for folding, which might increase protein interactions. The consensus protein sub-cellular localization predictions suggest that dehalogenase protein is a periplasmic protein 3D2GO server, suggesting that it is mainly employed in metabolic process followed by hydrolase activity and catalytic activity. The tertiary structure of protein was predicted by homology modeling. The result suggests that the protein is an unstable protein which is also an important characteristic of active enzyme enabling them to bind various cofactors and substrate for proper functioning. Validation of 3D structure was done using Ramachandran plot ProsA-web and RMSD score. This predicted information will help in better understanding of mechanism underlying haloacid dehalogenase encoding protein and its evolutionary relationship. PMID:23251046

  6. N-Terminal Amino Acid Sequence Determination of Proteins by N-Terminal Dimethyl Labeling: Pitfalls and Advantages When Compared with Edman Degradation Sequence Analysis.

    PubMed

    Chang, Elizabeth; Pourmal, Sergei; Zhou, Chun; Kumar, Rupesh; Teplova, Marianna; Pavletich, Nikola P; Marians, Kenneth J; Erdjument-Bromage, Hediye

    2016-07-01

    In recent history, alternative approaches to Edman sequencing have been investigated, and to this end, the Association of Biomolecular Resource Facilities (ABRF) Protein Sequencing Research Group (PSRG) initiated studies in 2014 and 2015, looking into bottom-up and top-down N-terminal (Nt) dimethyl derivatization of standard quantities of intact proteins with the aim to determine Nt sequence information. We have expanded this initiative and used low picomole amounts of myoglobin to determine the efficiency of Nt-dimethylation. Application of this approach on protein domains, generated by limited proteolysis of overexpressed proteins, confirms that it is a universal labeling technique and is very sensitive when compared with Edman sequencing. Finally, we compared Edman sequencing and Nt-dimethylation of the same polypeptide fragments; results confirm that there is agreement in the identity of the Nt amino acid sequence between these 2 methods. PMID:27006647

  7. N-Terminal Amino Acid Sequence Determination of Proteins by N-Terminal Dimethyl Labeling: Pitfalls and Advantages When Compared with Edman Degradation Sequence Analysis

    PubMed Central

    Chang, Elizabeth; Pourmal, Sergei; Zhou, Chun; Kumar, Rupesh; Teplova, Marianna; Pavletich, Nikola P.; Marians, Kenneth J.

    2016-01-01

    In recent history, alternative approaches to Edman sequencing have been investigated, and to this end, the Association of Biomolecular Resource Facilities (ABRF) Protein Sequencing Research Group (PSRG) initiated studies in 2014 and 2015, looking into bottom-up and top-down N-terminal (Nt) dimethyl derivatization of standard quantities of intact proteins with the aim to determine Nt sequence information. We have expanded this initiative and used low picomole amounts of myoglobin to determine the efficiency of Nt-dimethylation. Application of this approach on protein domains, generated by limited proteolysis of overexpressed proteins, confirms that it is a universal labeling technique and is very sensitive when compared with Edman sequencing. Finally, we compared Edman sequencing and Nt-dimethylation of the same polypeptide fragments; results confirm that there is agreement in the identity of the Nt amino acid sequence between these 2 methods. PMID:27006647

  8. Partial amino acid sequence of fructose-1,6-bisphosphatase from the blue-green algae Synechococcus leopoliensis.

    PubMed

    Marcus, F; Latshaw, S P; Steup, M; Gerbling, K P

    1989-08-01

    Purified fructose-1,6-bisphosphatase from the cyanobacterium Synechococcus leopoliensis was S-carboxymethylated and cleaved with trypsin. The resulting peptides were purified by reversed-phase high performance liquid chromatography and the amino acid sequence of six of the purified peptides was determined by gas-phase microsequencing. The results revealed sequence homology with other fructose-1,6-bisphosphatases. The obtained sequence data provides information required for the design of oligonucleotide hybridization probes to screen existing libraries of cyanobacterial DNA. The determination of the amino acid sequence of cyanobacterial proteins may yield important information with respect to the endosymbiotic theory of evolution. PMID:2550924

  9. Protein sequence analysis by incorporating modified chaos game and physicochemical properties into Chou's general pseudo amino acid composition.

    PubMed

    Xu, Chunrui; Sun, Dandan; Liu, Shenghui; Zhang, Yusen

    2016-10-01

    In this contribution we introduced a novel graphical method to compare protein sequences. By mapping a protein sequence into 3D space based on codons and physicochemical properties of 20 amino acids, we are able to get a unique P-vector from the 3D curve. This approach is consistent with wobble theory of amino acids. We compute the distance between sequences by their P-vectors to measure similarities/dissimilarities among protein sequences. Finally, we use our method to analyze four datasets and get better results compared with previous approaches. PMID:27375218

  10. Improving amino-acid identification, fit and C(alpha) prediction using the Simplex method in automated model building.

    PubMed

    Romo, Tod D; Sacchettini, James C; Ioerger, Thomas R

    2006-11-01

    Automated methods for protein model building in X-ray crystallography typically use a two-phased approach that involves first modeling the protein backbone followed by building in the side chains. The latter phase requires the identification of the amino-acid side-chain type as well as fitting of the side-chain model into the observed electron density. While mistakes in identification of individual side chains are common for a number of reasons, sequence alignment can sometimes be used to correct errors by mapping fragments into the true (expected) amino-acid sequence and exploiting contiguity constraints among neighbors. However, side chains cannot always be confidently aligned; this depends on having sufficient accuracy in the initial calls. The recognition of amino-acid side-chains based on the surrounding pattern of electron density, whether by features, density correlation or free atoms, can be sensitive to inaccuracies in the coordinates of the predicted backbone C(alpha) atoms to which they are anchored. By incorporating a Nelder-Mead Simplex search into the side-chain identification and model-building routines of TEXTAL, it is demonstrated that this form of residue-by-residue rigid-body real-space refinement (in which the C(alpha) itself is allowed to shift) can improve the initial accuracy of side-chain selection by over 25% on average (from 25% average identity to 32% on a test set of five representative proteins, without corrections by sequence alignment). This improvement in amino-acid selection accuracy in TEXTAL is often sufficient to bring the pairwise amino-acid identity of chains in the model out of the so-called ;twilight zone' for sequence-alignment methods. When coupled with sequence alignment, use of the Simplex search yielded improvements in side-chain accuracy on average by over 13 percentage points (from 64 to 77%) and up to 38 percentage points (from 40 to 78%) in one case compared with using sequence alignment alone. PMID:17057345

  11. LncDisease: a sequence based bioinformatics tool for predicting lncRNA-disease associations

    PubMed Central

    Wang, Junyi; Ma, Ruixia; Ma, Wei; Chen, Ji; Yang, Jichun; Xi, Yaguang; Cui, Qinghua

    2016-01-01

    LncRNAs represent a large class of noncoding RNA molecules that have important functions and play key roles in a variety of human diseases. There is an urgent need to develop bioinformatics tools as to gain insight into lncRNAs. This study developed a sequence-based bioinformatics method, LncDisease, to predict the lncRNA-disease associations based on the crosstalk between lncRNAs and miRNAs. Using LncDisease, we predicted the lncRNAs associated with breast cancer and hypertension. The breast-cancer-associated lncRNAs were studied in two breast tumor cell lines, MCF-7 and MDA-MB-231. The qRT-PCR results showed that 11 (91.7%) of the 12 predicted lncRNAs could be validated in both breast cancer cell lines. The hypertension-associated lncRNAs were further evaluated in human vascular smooth muscle cells (VSMCs) stimulated with angiotensin II (Ang II). The qRT-PCR results showed that 3 (75.0%) of the 4 predicted lncRNAs could be validated in Ang II-treated human VSMCs. In addition, we predicted 6 diseases associated with the lncRNA GAS5 and validated 4 (66.7%) of them by literature mining. These results greatly support the specificity and efficacy of LncDisease in the study of lncRNAs in human diseases. The LncDisease software is freely available on the Software Page: http://www.cuilab.cn/. PMID:26887819

  12. LncDisease: a sequence based bioinformatics tool for predicting lncRNA-disease associations.

    PubMed

    Wang, Junyi; Ma, Ruixia; Ma, Wei; Chen, Ji; Yang, Jichun; Xi, Yaguang; Cui, Qinghua

    2016-05-19

    LncRNAs represent a large class of noncoding RNA molecules that have important functions and play key roles in a variety of human diseases. There is an urgent need to develop bioinformatics tools as to gain insight into lncRNAs. This study developed a sequence-based bioinformatics method, LncDisease, to predict the lncRNA-disease associations based on the crosstalk between lncRNAs and miRNAs. Using LncDisease, we predicted the lncRNAs associated with breast cancer and hypertension. The breast-cancer-associated lncRNAs were studied in two breast tumor cell lines, MCF-7 and MDA-MB-231. The qRT-PCR results showed that 11 (91.7%) of the 12 predicted lncRNAs could be validated in both breast cancer cell lines. The hypertension-associated lncRNAs were further evaluated in human vascular smooth muscle cells (VSMCs) stimulated with angiotensin II (Ang II). The qRT-PCR results showed that 3 (75.0%) of the 4 predicted lncRNAs could be validated in Ang II-treated human VSMCs. In addition, we predicted 6 diseases associated with the lncRNA GAS5 and validated 4 (66.7%) of them by literature mining. These results greatly support the specificity and efficacy of LncDisease in the study of lncRNAs in human diseases. The LncDisease software is freely available on the Software Page: http://www.cuilab.cn/. PMID:26887819

  13. JRC GMO-Amplicons: a collection of nucleic acid sequences related to genetically modified organisms.

    PubMed

    Petrillo, Mauro; Angers-Loustau, Alexandre; Henriksson, Peter; Bonfini, Laura; Patak, Alex; Kreysa, Joachim

    2015-01-01

    The DNA target sequence is the key element in designing detection methods for genetically modified organisms (GMOs). Unfortunately this information is frequently lacking, especially for unauthorized GMOs. In addition, patent sequences are generally poorly annotated, buried in complex and extensive documentation and hard to link to the corresponding GM event. Here, we present the JRC GMO-Amplicons, a database of amplicons collected by screening public nucleotide sequence databanks by in silico determination of PCR amplification with reference methods for GMO analysis. The European Union Reference Laboratory for Genetically Modified Food and Feed (EU-RL GMFF) provides these methods in the GMOMETHODS database to support enforcement of EU legislation and GM food/feed control. The JRC GMO-Amplicons database is composed of more than 240 000 amplicons, which can be easily accessed and screened through a web interface. To our knowledge, this is the first attempt at pooling and collecting publicly available sequences related to GMOs in food and feed. The JRC GMO-Amplicons supports control laboratories in the design and assessment of GMO methods, providing inter-alia in silico prediction of primers specificity and GM targets coverage. The new tool can assist the laboratories in the analysis of complex issues, such as the detection and identification of unauthorized GMOs. Notably, the JRC GMO-Amplicons database allows the retrieval and characterization of GMO-related sequences included in patents documentation. Finally, it can help annotating poorly described GM sequences and identifying new relevant GMO-related sequences in public databases. The JRC GMO-Amplicons is freely accessible through a web-based portal that is hosted on the EU-RL GMFF website. Database URL: http://gmo-crl.jrc.ec.europa.eu/jrcgmoamplicons/. PMID:26424080

  14. JRC GMO-Amplicons: a collection of nucleic acid sequences related to genetically modified organisms

    PubMed Central

    Petrillo, Mauro; Angers-Loustau, Alexandre; Henriksson, Peter; Bonfini, Laura; Patak, Alex; Kreysa, Joachim

    2015-01-01

    The DNA target sequence is the key element in designing detection methods for genetically modified organisms (GMOs). Unfortunately this information is frequently lacking, especially for unauthorized GMOs. In addition, patent sequences are generally poorly annotated, buried in complex and extensive documentation and hard to link to the corresponding GM event. Here, we present the JRC GMO-Amplicons, a database of amplicons collected by screening public nucleotide sequence databanks by in silico determination of PCR amplification with reference methods for GMO analysis. The European Union Reference Laboratory for Genetically Modified Food and Feed (EU-RL GMFF) provides these methods in the GMOMETHODS database to support enforcement of EU legislation and GM food/feed control. The JRC GMO-Amplicons database is composed of more than 240 000 amplicons, which can be easily accessed and screened through a web interface. To our knowledge, this is the first attempt at pooling and collecting publicly available sequences related to GMOs in food and feed. The JRC GMO-Amplicons supports control laboratories in the design and assessment of GMO methods, providing inter-alia in silico prediction of primers specificity and GM targets coverage. The new tool can assist the laboratories in the analysis of complex issues, such as the detection and identification of unauthorized GMOs. Notably, the JRC GMO-Amplicons database allows the retrieval and characterization of GMO-related sequences included in patents documentation. Finally, it can help annotating poorly described GM sequences and identifying new relevant GMO-related sequences in public databases. The JRC GMO-Amplicons is freely accessible through a web-based portal that is hosted on the EU-RL GMFF website. Database URL: http://gmo-crl.jrc.ec.europa.eu/jrcgmoamplicons/ PMID:26424080

  15. Nucleotide sequence of the phosphoglycerate kinase gene from the extreme thermophile Thermus thermophilus. Comparison of the deduced amino acid sequence with that of the mesophilic yeast phosphoglycerate kinase.

    PubMed Central

    Bowen, D; Littlechild, J A; Fothergill, J E; Watson, H C; Hall, L

    1988-01-01

    Using oligonucleotide probes derived from amino acid sequencing information, the structural gene for phosphoglycerate kinase from the extreme thermophile, Thermus thermophilus, was cloned in Escherichia coli and its complete nucleotide sequence determined. The gene consists of an open reading frame corresponding to a protein of 390 amino acid residues (calculated Mr 41,791) with an extreme bias for G or C (93.1%) in the codon third base position. Comparison of the deduced amino acid sequence with that of the corresponding mesophilic yeast enzyme indicated a number of significant differences. These are discussed in terms of the unusual codon bias and their possible role in enhanced protein thermal stability. Images Fig. 1. PMID:3052437

  16. A Novel Data Assimilation Methodology for Predicting Lithology Based on Sequence Labeling Algorithms

    NASA Astrophysics Data System (ADS)

    Park, E.; Jeong, J.; Han, W. S.; Kim, K. Y.

    2014-12-01

    A hidden Markov model (HMM) and a conditional random fields (CRFs) model for lithological predictions based on multiple geophysical well-logging data are derived for dealing with directional non-stationarity through bi-directional training and conditioning. The developed models were benchmarked against their conventional counterparts, and hypothetical boreholes with the corresponding synthetic geophysical data including artificial errors were employed. In the three test scenarios devised, the average fitness and unfitness values of the developed CRFs model and HMM are 0.84 and 0.071, and 0.81 and 0.084, respectively, while those of the conventional CRFs model and HMM are 0.78 and 0.091, and 0.77 and 0.099, respectively. Comparisons of their predictabilities show that the models designed for directional non-stationarity clearly perform better than the conventional models for all tested examples. Among them, the developed linear-chain CRFs model showed the best or close to the best performance with high predictability and a low training data requirement. Keywords: one-dimensional lithological characterization, sequence labeling algorithm, conditional random fields, hidden Markov model, borehole, geophysical well-logging data.

  17. Bacteria obtained from a sequencing batch reactor that are capable of growth on dehydroabietic acid.

    PubMed Central

    Mohn, W W

    1995-01-01

    Eleven isolates capable of growth on the resin acid dehydroabietic acid (DhA) were obtained from a sequencing batch reactor designed to treat a high-strength process stream from a paper mill. The isolates belonged to two groups, represented by strains DhA-33 and DhA-35, which were characterized. In the bioreactor, bacteria like DhA-35 were more abundant than those like DhA-33. The population in the bioreactor of organisms capable of growth on DhA was estimated to be 1.1 x 10(6) propagules per ml, based on a most-probable-number determination. Analysis of small-subunit rRNA partial sequences indicated that DhA-33 was most closely related to Sphingomonas yanoikuyae (Sab = 0.875) and that DhA-35 was most closely related to Zoogloea ramigera (Sab = 0.849). Both isolates additionally grew on other abietanes, i.e., abietic and palustric acids, but not on the pimaranes, pimaric and isopimaric acids. For DhA-33 and DhA-35 with DhA as the sole organic substrate, doubling times were 2.7 and 2.2 h, respectively, and growth yields were 0.30 and 0.25 g of protein per g of DhA, respectively. Glucose as a cosubstrate stimulated growth of DhA-33 on DhA and stimulated DhA degradation by the culture. Pyruvate as a cosubstrate did not stimulate growth of DhA-35 on DhA and reduced the specific rate of DhA degradation of the culture. DhA induced DhA and abietic acid degradation activities in both strains, and these activities were heat labile. Cell suspensions of both strains consumed DhA at a rate of 6 mumol mg of protein-1 h-1.(ABSTRACT TRUNCATED AT 250 WORDS) PMID:7793937

  18. Genomic-scale comparison of sequence- and structure-based methods of function prediction: Does structure provide additional insight?

    PubMed Central

    Fetrow, Jacquelyn S.; Siew, Naomi; Di Gennaro, Jeannine A.; Martinez-Yamout, Maria; Dyson, H. Jane; Skolnick, Jeffrey

    2001-01-01

    A function annotation method using the sequence-to-structure-to-function paradigm is applied to the identification of all disulfide oxidoreductases in the Saccharomyces cerevisiae genome. The method identifies 27 sequences as potential disulfide oxidoreductases. All previously known thioredoxins, glutaredoxins, and disulfide isomerases are correctly identified. Three of the 27 predictions are probable false-positives. Three novel predictions, which subsequently have been experimentally validated, are presented. Two additional novel predictions suggest a disulfide oxidoreductase regulatory mechanism for two subunits (OST3 and OST6) of the yeast oligosaccharyltransferase complex. Based on homology, this prediction can be extended to a potential tumor suppressor gene, N33, in humans, whose biochemical function was not previously known. Attempts to obtain a folded, active N33 construct to test the prediction were unsuccessful. The results show that structure prediction coupled with biochemically relevant structural motifs is a powerful method for the function annotation of genome sequences and can provide more detailed, robust predictions than function prediction methods that rely on sequence comparison alone. PMID:11316881

  19. Prediction of multi-drug resistance transporters using a novel sequence analysis method [version 2; referees: 2 approved

    DOE PAGESBeta

    McDermott, Jason E.; Bruillard, Paul; Overall, Christopher C.; Gosink, Luke; Lindemann, Stephen R.

    2015-03-09

    There are many examples of groups of proteins that have similar function, but the determinants of functional specificity may be hidden by lack of sequencesimilarity, or by large groups of similar sequences with different functions. Transporters are one such protein group in that the general function, transport, can be easily inferred from the sequence, but the substrate specificity can be impossible to predict from sequence with current methods. In this paper we describe a linguistic-based approach to identify functional patterns from groups of unaligned protein sequences and its application to predict multi-drug resistance transporters (MDRs) from bacteria. We first showmore » that our method can recreate known patterns from PROSITE for several motifs from unaligned sequences. We then show that the method, MDRpred, can predict MDRs with greater accuracy and positive predictive value than a collection of currently available family-based models from the Pfam database. Finally, we apply MDRpred to a large collection of protein sequences from an environmental microbiome study to make novel predictions about drug resistance in a potential environmental reservoir.« less

  20. Prediction of multi-drug resistance transporters using a novel sequence analysis method [version 2; referees: 2 approved

    SciTech Connect

    McDermott, Jason E.; Bruillard, Paul; Overall, Christopher C.; Gosink, Luke; Lindemann, Stephen R.

    2015-03-09

    There are many examples of groups of proteins that have similar function, but the determinants of functional specificity may be hidden by lack of sequencesimilarity, or by large groups of similar sequences with different functions. Transporters are one such protein group in that the general function, transport, can be easily inferred from the sequence, but the substrate specificity can be impossible to predict from sequence with current methods. In this paper we describe a linguistic-based approach to identify functional patterns from groups of unaligned protein sequences and its application to predict multi-drug resistance transporters (MDRs) from bacteria. We first show that our method can recreate known patterns from PROSITE for several motifs from unaligned sequences. We then show that the method, MDRpred, can predict MDRs with greater accuracy and positive predictive value than a collection of currently available family-based models from the Pfam database. Finally, we apply MDRpred to a large collection of protein sequences from an environmental microbiome study to make novel predictions about drug resistance in a potential environmental reservoir.

  1. Predicting lipase types by improved Chou's pseudo-amino acid composition.

    PubMed

    Zhang, Guang-Ya; Li, Hong-Chun; Gao, Jia-Qiang; Fang, Bai-Shan

    2008-01-01

    By proposing a improved Chou's pseudo amino acid composition approach to extract the features of the sequences, a powerful predictor based on k-nearest neighbor was introduced to identify the types of lipases according to their sequences. To avoid redundancy and bias, demonstrations were performed on a dataset where none of the proteins has > or =25% sequence identity to any other. The overall success rate thus obtained by the 10-fold cross-validation test was over 90%, indicating that the improved Chou's pseudo amino acid composition might be a useful tool for extracting the features of protein sequences, or at lease can play a complementary role to many of the other existing approaches. PMID:19075826

  2. Nucleic and amino acid sequences relating to a novel transketolase, and methods for the expression thereof

    DOEpatents

    Croteau, Rodney Bruce; Wildung, Mark Raymond; Lange, Bernd Markus; McCaskill, David G.

    2001-01-01

    cDNAs encoding 1-deoxyxylulose-5-phosphate synthase from peppermint (Mentha piperita) have been isolated and sequenced, and the corresponding amino acid sequences have been determined. Accordingly, isolated DNA sequences (SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7) are provided which code for the expression of 1-deoxyxylulose-5-phosphate synthase from plants. In another aspect the present invention provides for isolated, recombinant DXPS proteins, such as the proteins having the sequences set forth in SEQ ID NO:4, SEQ ID NO:6 and SEQ ID NO:8. In other aspects, replicable recombinant cloning vehicles are provided which code for plant 1-deoxyxylulose-5-phosphate synthases, or for a base sequence sufficiently complementary to at least a portion of 1-deoxyxylulose-5-phosphate synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding a plant 1-deoxyxylulose-5-phosphate synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant 1-deoxyxylulose-5-phosphate synthase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant 1-deoxyxylulose-5-phosphate synthase may be used to obtain expression or enhanced expression of 1-deoxyxylulose-5-phosphate synthase in plants in order to enhance the production of 1-deoxyxylulose-5-phosphate, or its derivatives such as isopentenyl diphosphate (BP), or may be otherwise employed for the regulation or expression of 1-deoxyxylulose-5-phosphate synthase, or the production of its products.

  3. Novel method for PIK3CA mutation analysis: locked nucleic acid--PCR sequencing.

    PubMed

    Ang, Daphne; O'Gara, Rebecca; Schilling, Amy; Beadling, Carol; Warrick, Andrea; Troxell, Megan L; Corless, Christopher L

    2013-05-01

    Somatic mutations in PIK3CA are commonly seen in invasive breast cancer and several other carcinomas, occurring in three hotspots: codons 542 and 545 of exon 9 and in codon 1047 of exon 20. We designed a locked nucleic acid (LNA)-PCR sequencing assay to detect low levels of mutant PIK3CA DNA with attention to avoiding amplification of a pseudogene on chromosome 22 that has >95% homology to exon 9 of PIK3CA. We tested 60 FFPE breast DNA samples with known PIK3CA mutation status (48 cases had one or more PIK3CA mutations, and 12 were wild type) as identified by PCR-mass spectrometry. PIK3CA exons 9 and 20 were amplified in the presence or absence of LNA-oligonucleotides designed to bind to the wild-type sequences for codons 542, 545, and 1047, and partially suppress their amplification. LNA-PCR sequencing confirmed all 51 PIK3CA mutations; however, the mutation detection rate by standard Sanger sequencing was only 69% (35 of 51). Of the 12 PIK3CA wild-type cases, LNA-PCR sequencing detected three additional H1047R mutations in "normal" breast tissue and one E545K in usual ductal hyperplasia. Histopathological review of these three normal breast specimens showed columnar cell change in two (both with known H1047R mutations) and apocrine metaplasia in one. The novel LNA-PCR shows higher sensitivity than standard Sanger sequencing and did not amplify the known pseudogene. PMID:23541593

  4. Whole-Genome Sequencing Analysis Accurately Predicts Antimicrobial Resistance Phenotypes in Campylobacter spp.

    PubMed

    Zhao, S; Tyson, G H; Chen, Y; Li, C; Mukherjee, S; Young, S; Lam, C; Folster, J P; Whichard, J M; McDermott, P F

    2016-01-01

    The objectives of this study were to identify antimicrobial resistance genotypes for Campylobacter and to evaluate the correlation between resistance phenotypes and genotypes using in vitro antimicrobial susceptibility testing and whole-genome sequencing (WGS). A total of 114 Campylobacter species isolates (82 C. coli and 32 C. jejuni) obtained from 2000 to 2013 from humans, retail meats, and cecal samples from food production animals in the United States as part of the National Antimicrobial Resistance Monitoring System were selected for study. Resistance phenotypes were determined using broth microdilution of nine antimicrobials. Genomic DNA was sequenced using the Illumina MiSeq platform, and resistance genotypes were identified using assembled WGS sequences through blastx analysis. Eighteen resistance genes, including tet(O), blaOXA-61, catA, lnu(C), aph(2″)-Ib, aph(2″)-Ic, aph(2')-If, aph(2″)-Ig, aph(2″)-Ih, aac(6')-Ie-aph(2″)-Ia, aac(6')-Ie-aph(2″)-If, aac(6')-Im, aadE, sat4, ant(6'), aad9, aph(3')-Ic, and aph(3')-IIIa, and mutations in two housekeeping genes (gyrA and 23S rRNA) were identified. There was a high degree of correlation between phenotypic resistance to a given drug and the presence of one or more corresponding resistance genes. Phenotypic and genotypic correlation was 100% for tetracycline, ciprofloxacin/nalidixic acid, and erythromycin, and correlations ranged from 95.4% to 98.7% for gentamicin, azithromycin, clindamycin, and telithromycin. All isolates were susceptible to florfenicol, and no genes associated with florfenicol resistance were detected. There was a strong correlation (99.2%) between resistance genotypes and phenotypes, suggesting that WGS is a reliable indicator of resistance to the nine antimicrobial agents assayed in this study. WGS has the potential to be a powerful tool for antimicrobial resistance surveillance programs. PMID:26519386

  5. Using increment of diversity to predict mitochondrial proteins of malaria parasite: integrating pseudo-amino acid composition and structural alphabet.

    PubMed

    Chen, Ying-Li; Li, Qian-Zhong; Zhang, Li-Qing

    2012-04-01

    Due to the complexity of Plasmodium falciparum (PF) genome, predicting mitochondrial proteins of PF is more difficult than other species. In this study, using the n-peptide composition of reduced amino acid alphabet (RAAA) obtained from structural alphabet named Protein Blocks as feature parameter, the increment of diversity (ID) is firstly developed to predict mitochondrial proteins. By choosing the 1-peptide compositions on the N-terminal regions with 20 residues as the only input vector, the prediction performance achieves 86.86% accuracy with 0.69 Mathew's correlation coefficient (MCC) by the jackknife test. Moreover, by combining with the hydropathy distribution along protein sequence and several reduced amino acid alphabets, we achieved maximum MCC 0.82 with accuracy 92% in the jackknife test by using the developed ID model. When evaluating on an independent dataset our method performs better than existing methods. The results indicate that the ID is a simple and efficient prediction method for mitochondrial proteins of malaria parasite. PMID:21191803

  6. Bile acid sulfotransferase I from rat liver sulfates bile acids and 3-hydroxy steroids: purification, N-terminal amino acid sequence, and kinetic properties.

    PubMed

    Barnes, S; Buchina, E S; King, R J; McBurnett, T; Taylor, K B

    1989-04-01

    A bile acid:3'phosphoadenosine-5'phosphosulfate:sulfotransferase (BAST I) from adult female rat liver cytosol has been purified 157-fold by a two-step isolation procedure. The N-terminal amino acid sequence of the 30,000 subunit has been determined for the first 35 residues. The Vmax of purified BAST I is 18.7 nmol/min per mg protein with N-(3-hydroxy-5 beta-cholanoyl)glycine (glycolithocholic acid) as substrate, comparable to that of the corresponding purified human BAST (Chen, L-J., and I. H. Segel, 1985. Arch. Biochem. Biophys. 241: 371-379). BAST I activity has a broad pH optimum from 5.5-7.5. Although maximum activity occurs with 5 mM MgCl2, Mg2+ is not essential for BAST I activity. The greatest sulfotransferase activity and the highest substrate affinity is observed with bile acids or steroids that have a steroid nucleus containing a 3 beta-hydroxy group and a 5-6 double bond or a trans A-B ring junction. These substrates have normal hyperbolic initial velocity curves with substrate inhibition occurring above 5 microM. Of the saturated 5 beta-bile acids, those with a single 3-hydroxy group are the most active. The addition of a second hydroxy group at the 6- or 7-position eliminates more than 99% of the activity. In contrast, 3 alpha,12 alpha-dihydroxy-5 beta-cholan-24-oic acid (deoxycholic acid) is an excellent substrate. The initial velocity curves for glycolithocholic and deoxycholic acid conjugates are sigmoidal rather than hyperbolic, suggestive of an allosteric effect. Maximum activity is observed at 80 microM for glycolithocholic acid. All substrates, bile acids and steroids, are inhibited by the 5 beta-bile acid, 3-keto-5 beta-cholanoic acid. The data suggest that BAST I is the same protein as hydrosteroid sulfotransferase 2 (Marcus, C. J., et al. 1980. Anal. Biochem. 107: 296-304). PMID:2754334

  7. Complete Genome Sequence of the Amino Acid-Fermenting Clostridium propionicum X2 (DSM 1682)

    PubMed Central

    Poehlein, Anja; Schlien, Katja; Chowdhury, Nilanjan Pal; Gottschalk, Gerhard; Buckel, Wolfgang

    2016-01-01

    Clostridium propionicum is a strict anaerobic, Gram positive, rod-shaped bacterium that belongs to the clostridial cluster XIVb. The genome consists of one replicon (3.1 Mb) and harbors 2,936 predicted protein-encoding genes. The genome encodes all enzymes required for fermentation of the amino acids α-alanine, β-alanine, serine, threonine, and methionine. PMID:27081148

  8. Sequence-defined bioactive macrocycles via an acid-catalysed cascade reaction

    NASA Astrophysics Data System (ADS)

    Porel, Mintu; Thornlow, Dana N.; Phan, Ngoc N.; Alabi, Christopher A.

    2016-06-01

    Synthetic macrocycles derived from sequence-defined oligomers are a unique structural class whose ring size, sequence and structure can be tuned via precise organization of the primary sequence. Similar to peptides and other peptidomimetics, these well-defined synthetic macromolecules become pharmacologically relevant when bioactive side chains are incorporated into their primary sequence. In this article, we report the synthesis of oligothioetheramide (oligoTEA) macrocycles via a one-pot acid-catalysed cascade reaction. The versatility of the cyclization chemistry and modularity of the assembly process was demonstrated via the synthesis of >20 diverse oligoTEA macrocycles. Structural characterization via NMR spectroscopy revealed the presence of conformational isomers, which enabled the determination of local chain dynamics within the macromolecular structure. Finally, we demonstrate the biological activity of oligoTEA macrocycles designed to mimic facially amphiphilic antimicrobial peptides. The preliminary results indicate that macrocyclic oligoTEAs with just two-to-three cationic charge centres can elicit potent antibacterial activity against Gram-positive and Gram-negative bacteria.

  9. Unconventional amino acid sequence of the sun anemone (Stoichactis helianthus) polypeptide neurotoxin

    SciTech Connect

    Kem, W.; Dunn, B.; Parten, B.; Pennington, M.; Price, D.

    1986-05-01

    A 5000 dalton polypeptide neurotoxin (Sh-NI) purified by G50 Sephadex, P-cellulose, and SP-Sephadex chromatography was homogeneous by isoelectric focusing. Sh-NI was highly toxic to crayfish (LD/sub 50/ 0.6 ..mu..g/kg) but without effect upon mice at 15,000 ..mu..g/kg (i.p. injection). The reduced, /sup 3/H-carboxymethylated toxin and its fragments were subjected to automatic Edman degradation and the resulting PTH-amino acids were identified by HPLC, back hydrolysis, and scintillation counting. Peptides resulting from proteolytic (clostripain, staphylococcal protease) and chemical (tryptophan) cleavage were sequenced. The sequence is: AACKCDDEGPDIRTAPLTGTVDLGSCNAGWEKCASYYTIIADCCRKKK. This sequence differs considerably from the homologous Anemonia and Anthopleura toxins; many of the identical residues (6 half-cystines, G9, P10, R13, G19, G29, W30) are probably critical for folding rather than receptor recognition. However, the Sh-NI sequence closely resembles Radioanthus macrodactylus neurotoxin III and r. paumotensis II. The authors propose that Sh-NI and related Radioanthus toxins act upon a different site on the sodium channel.

  10. Repeat sequence chromosome specific nucleic acid probes and methods of preparing and using

    DOEpatents

    Weier, H.U.G.; Gray, J.W.

    1995-06-27

    A primer directed DNA amplification method to isolate efficiently chromosome-specific repeated DNA wherein degenerate oligonucleotide primers are used is disclosed. The probes produced are a heterogeneous mixture that can be used with blocking DNA as a chromosome-specific staining reagent, and/or the elements of the mixture can be screened for high specificity, size and/or high degree of repetition among other parameters. The degenerate primers are sets of primers that vary in sequence but are substantially complementary to highly repeated nucleic acid sequences, preferably clustered within the template DNA, for example, pericentromeric alpha satellite repeat sequences. The template DNA is preferably chromosome-specific. Exemplary primers and probes are disclosed. The probes of this invention can be used to determine the number of chromosomes of a specific type in metaphase spreads, in germ line and/or somatic cell interphase nuclei, micronuclei and/or in tissue sections. Also provided is a method to select arbitrarily repeat sequence probes that can be screened for chromosome-specificity. 18 figs.

  11. Repeat sequence chromosome specific nucleic acid probes and methods of preparing and using

    DOEpatents

    Weier, Heinz-Ulrich G.; Gray, Joe W.

    1995-01-01

    A primer directed DNA amplification method to isolate efficiently chromosome-specific repeated DNA wherein degenerate oligonucleotide primers are used is disclosed. The probes produced are a heterogeneous mixture that can be used with blocking DNA as a chromosome-specific staining reagent, and/or the elements of the mixture can be screened for high specificity, size and/or high degree of repetition among other parameters. The degenerate primers are sets of primers that vary in sequence but are substantially complementary to highly repeated nucleic acid sequences, preferably clustered within the template DNA, for example, pericentromeric alpha satellite repeat sequences. The template DNA is preferably chromosome-specific. Exemplary primers ard probes are disclosed. The probes of this invention can be used to determine the number of chromosomes of a specific type in metaphase spreads, in germ line and/or somatic cell interphase nuclei, micronuclei and/or in tissue sections. Also provided is a method to select arbitrarily repeat sequence probes that can be screened for chromosome-specificity.

  12. Detection of Nucleic Acids with Graphene Nanopores: Ab Initio Characterization of a Novel Sequencing Device

    NASA Astrophysics Data System (ADS)

    Nelson, Tammie; Zhang, Bo; Prezhdo, Oleg

    2010-03-01

    We report an ab initio study of the interaction of two nucleobases, cytosine and adenine, with a novel graphene nanopore device for detecting the base sequence of a single-stranded nucleic acid (ssDNA or RNA). The nucleobases were inserted into a pore in a graphene nanoribbon, and the electrical current and conductance spectra were calculated as functions of voltage applied across the nanoribbon. The conductance spectra and charge densities were analyzed in the presence of each nucleobase in the graphene nanopore. The results indicate that, due to significant differences in the conductance spectra, the proposed device has adequate sensitivity to discriminate between different nucleotides. Moreover, we show that the nucleotide conductance spectra is not affected by its orientation inside the graphene nanopore. The proposed technique may be extremely useful for real applications in developing ultrafast, low cost DNA sequencing methods.

  13. SubMito-PSPCP: Predicting Protein Submitochondrial Locations by Hybridizing Positional Specific Physicochemical Properties with Pseudoamino Acid Compositions

    PubMed Central

    Yu, Yuan

    2013-01-01

    Knowing the submitochondrial location of a mitochondrial protein is an important step in understanding its function. We developed a new method for predicting protein submitochondrial locations by introducing a new concept: positional specific physicochemical properties. With the framework of general form pseudoamino acid compositions, our method used only about 100 features to represent protein sequences, which is much simpler than the existing methods. On the dataset of SubMito, our method achieved over 93% overall accuracy, with 98.60% for inner membrane, 93.90% for matrix, and 70.70% for outer membrane, which are comparable to all state-of-the-art methods. As our method can be used as a general method to upgrade all pseudoamino-acid-composition-based methods, it should be very useful in future studies. We implement our method as an online service: SubMito-PSPCP. PMID:24027753

  14. Electromyographic Patterns during Golf Swing: Activation Sequence Profiling and Prediction of Shot Effectiveness.

    PubMed

    Verikas, Antanas; Vaiciukynas, Evaldas; Gelzinis, Adas; Parker, James; Olsson, M Charlotte

    2016-01-01

    This study analyzes muscle activity, recorded in an eight-channel electromyographic (EMG) signal stream, during the golf swing using a 7-iron club and exploits information extracted from EMG dynamics to predict the success of the resulting shot. Muscles of the arm and shoulder on both the left and right sides, namely flexor carpi radialis, extensor digitorum communis, rhomboideus and trapezius, are considered for 15 golf players (∼5 shots each). The method using Gaussian filtering is outlined for EMG onset time estimation in each channel and activation sequence profiling. Shots of each player revealed a persistent pattern of muscle activation. Profiles were plotted and insights with respect to player effectiveness were provided. Inspection of EMG dynamics revealed a pair of highest peaks in each channel as the hallmark of golf swing, and a custom application of peak detection for automatic extraction of swing segment was introduced. Various EMG features, encompassing 22 feature sets, were constructed. Feature sets were used individually and also in decision-level fusion for the prediction of shot effectiveness. The prediction of the target attribute, such as club head speed or ball carry distance, was investigated using random forest as the learner in detection and regression tasks. Detection evaluates the personal effectiveness of a shot with respect to the player-specific average, whereas regression estimates the value of target attribute, using EMG features as predictors. Fusion after decision optimization provided the best results: the equal error rate in detection was 24.3% for the speed and 31.7% for the distance; the mean absolute percentage error in regression was 3.2% for the speed and 6.4% for the distance. Proposed EMG feature sets were found to be useful, especially when used in combination. Rankings of feature sets indicated statistics for muscle activity in both the left and right body sides, correlation-based analysis of EMG dynamics and features

  15. Electromyographic Patterns during Golf Swing: Activation Sequence Profiling and Prediction of Shot Effectiveness

    PubMed Central

    Verikas, Antanas; Vaiciukynas, Evaldas; Gelzinis, Adas; Parker, James; Olsson, M. Charlotte

    2016-01-01

    This study analyzes muscle activity, recorded in an eight-channel electromyographic (EMG) signal stream, during the golf swing using a 7-iron club and exploits information extracted from EMG dynamics to predict the success of the resulting shot. Muscles of the arm and shoulder on both the left and right sides, namely flexor carpi radialis, extensor digitorum communis, rhomboideus and trapezius, are considered for 15 golf players (∼5 shots each). The method using Gaussian filtering is outlined for EMG onset time estimation in each channel and activation sequence profiling. Shots of each player revealed a persistent pattern of muscle activation. Profiles were plotted and insights with respect to player effectiveness were provided. Inspection of EMG dynamics revealed a pair of highest peaks in each channel as the hallmark of golf swing, and a custom application of peak detection for automatic extraction of swing segment was introduced. Various EMG features, encompassing 22 feature sets, were constructed. Feature sets were used individually and also in decision-level fusion for the prediction of shot effectiveness. The prediction of the target attribute, such as club head speed or ball carry distance, was investigated using random forest as the learner in detection and regression tasks. Detection evaluates the personal effectiveness of a shot with respect to the player-specific average, whereas regression estimates the value of target attribute, using EMG features as predictors. Fusion after decision optimization provided the best results: the equal error rate in detection was 24.3% for the speed and 31.7% for the distance; the mean absolute percentage error in regression was 3.2% for the speed and 6.4% for the distance. Proposed EMG feature sets were found to be useful, especially when used in combination. Rankings of feature sets indicated statistics for muscle activity in both the left and right body sides, correlation-based analysis of EMG dynamics and features

  16. Morphological tranformation of calcite crystal growth by prismatic "acidic" polypeptide sequences.

    SciTech Connect

    Kim, I; Giocondi, J L; Orme, C A; Collino, J; Evans, J S

    2007-02-13

    Many of the interesting mechanical and materials properties of the mollusk shell are thought to stem from the prismatic calcite crystal assemblies within this composite structure. It is now evident that proteins play a major role in the formation of these assemblies. Recently, a superfamily of 7 conserved prismatic layer-specific mollusk shell proteins, Asprich, were sequenced, and the 42 AA C-terminal sequence region of this protein superfamily was found to introduce surface voids or porosities on calcite crystals in vitro. Using AFM imaging techniques, we further investigate the effect that this 42 AA domain (Fragment-2) and its constituent subdomains, DEAD-17 and Acidic-2, have on the morphology and growth kinetics of calcite dislocation hillocks. We find that Fragment-2 adsorbs on terrace surfaces and pins acute steps, accelerates then decelerates the growth of obtuse steps, forms clusters and voids on terrace surfaces, and transforms calcite hillock morphology from a rhombohedral form to a rounded one. These results mirror yet are distinct from some of the earlier findings obtained for nacreous polypeptides. The subdomains Acidic-2 and DEAD-17 were found to accelerate then decelerate obtuse steps and induce oval rather than rounded hillock morphologies. Unlike DEAD-17, Acidic-2 does form clusters on terrace surfaces and exhibits stronger obtuse velocity inhibition effects than either DEAD-17 or Fragment-2. Interestingly, a 1:1 mixture of both subdomains induces an irregular polygonal morphology to hillocks, and exhibits the highest degree of acute step pinning and obtuse step velocity inhibition. This suggests that there is some interplay between subdomains within an intra (Fragment-2) or intermolecular (1:1 mixture) context, and sequence interplay phenomena may be employed by biomineralization proteins to exert net effects on crystal growth and morphology.

  17. Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou's General Pseudo Amino Acid Composition.

    PubMed

    Ahmad, Khurshid; Waris, Muhammad; Hayat, Maqsood

    2016-06-01

    Mitochondrion is the key organelle of eukaryotic cell, which provides energy for cellular activities. Submitochondrial locations of proteins play crucial role in understanding different biological processes such as energy metabolism, program cell death, and ionic homeostasis. Prediction of submitochondrial locations through conventional methods are expensive and time consuming because of the large number of protein sequences generated in the last few decades. Therefore, it is intensively desired to establish an automated model for identification of submitochondrial locations of proteins. In this regard, the current study is initiated to develop a fast, reliable, and accurate computational model. Various feature extraction methods such as dipeptide composition (DPC), Split Amino Acid Composition, and Composition and Translation were utilized. In order to overcome the issue of biasness, oversampling technique SMOTE was applied to balance the datasets. Several classification learners including K-Nearest Neighbor, Probabilistic Neural Network, and support vector machine (SVM) are used. Jackknife test is applied to assess the performance of classification algorithms using two benchmark datasets. Among various classification algorithms, SVM achieved the highest success rates in conjunction with the condensed feature space of DPC, which are 95.20 % accuracy on dataset SML3-317 and 95.11 % on dataset SML3-983. The empirical results revealed that our proposed model obtained the highest results so far in the literatures. It is anticipated that our proposed model might be useful for future studies. PMID:26746980

  18. Sequence features accurately predict genome-wide MeCP2 binding in vivo.

    PubMed

    Rube, H Tomas; Lee, Wooje; Hejna, Miroslav; Chen, Huaiyang; Yasui, Dag H; Hess, John F; LaSalle, Janine M; Song, Jun S; Gong, Qizhi

    2016-01-01

    Methyl-CpG binding protein 2 (MeCP2) is critical for proper brain development and expressed at near-histone levels in neurons, but the mechanism of its genomic localization remains poorly understood. Using high-resolution MeCP2-binding data, we show that DNA sequence features alone can predict binding with 88% accuracy. Integrating MeCP2 binding and DNA methylation in a probabilistic graphical model, we demonstrate that previously reported genome-wide association with methylation is in part due to MeCP2's affinity to GC-rich chromatin, a result replicated using published data. Furthermore, MeCP2 co-localizes with nucleosomes. Finally, MeCP2 binding downstream of promoters correlates with increased expression in Mecp2-deficient neurons. PMID:27008915

  19. Sequence features accurately predict genome-wide MeCP2 binding in vivo

    PubMed Central

    Rube, H. Tomas; Lee, Wooje; Hejna, Miroslav; Chen, Huaiyang; Yasui, Dag H.; Hess, John F.; LaSalle, Janine M.; Song, Jun S.; Gong, Qizhi

    2016-01-01

    Methyl-CpG binding protein 2 (MeCP2) is critical for proper brain development and expressed at near-histone levels in neurons, but the mechanism of its genomic localization remains poorly understood. Using high-resolution MeCP2-binding data, we show that DNA sequence features alone can predict binding with 88% accuracy. Integrating MeCP2 binding and DNA methylation in a probabilistic graphical model, we demonstrate that previously reported genome-wide association with methylation is in part due to MeCP2's affinity to GC-rich chromatin, a result replicated using published data. Furthermore, MeCP2 co-localizes with nucleosomes. Finally, MeCP2 binding downstream of promoters correlates with increased expression in Mecp2-deficient neurons. PMID:27008915

  20. Applications of the predictability of the Coherent Noise Model to aftershock sequences

    NASA Astrophysics Data System (ADS)

    Christopoulos, Stavros-Richard; Sarlis, Nicholas

    2014-05-01

    A study [1] of the coherent noise model [2-4] in natural time [5-7] has shown that it exhibits predictability. Interestingly, one of the predictors suggested [1] for the coherent noise model can be generalized and applied to the case of (real) aftershock sequences. The results obtained [8] so far are beyond chance. Here, we apply this approach to several aftershock sequences of strong earthquakes with magnitudes Mw ≥6.9 in Indonesia, California and Greece, including the Mw9.2 earthquake that occurred on 26 December 2004 in Sumatra. References. [1] N. V. Sarlis and S.-R. G. Christopoulos, Predictability of the coherent-noise model and its applications, Physical Review E, 85, 051136, 2012. [2] M.E.J. Newman, Self-organized criticality, evolution and the fossil extinction record, Proc. R. Soc. London B, 263, 1605-1610, 1996. [3] M. E. J. Newman and K. Sneppen, Avalanches, scaling, and coherent noise, Phys. Rev. E, 54, 6226-6231, 1996. [4] K. Sneppen and M. Newman, Coherent noise, scale invariance and intermittency in large systems, Physica D, 110, 209 - 222. [5] P. Varotsos, N. Sarlis, and E. Skordas, Spatiotemporal complexity aspects on the interrelation between Seismic Electric Signals and seismicity, Practica of Athens Academy, 76, 294-321, 2001. [6] P.A. Varotsos, N.V. Sarlis, and E.S. Skordas, Long-range correlations in the electric signals that precede rupture, Phys. Rev. E, 66, 011902, 2002. [7] Varotsos P. A., Sarlis N. V. and Skordas E. S., Natural Time Analysis: The new view of time. Precursory Seismic Electric Signals, Earthquakes and other Complex Time-Series (Springer-Verlag, Berlin Heidelberg) 2011. [8] N. V. Sarlis and S.-R. G. Christopoulos, "Visualization of the significance of Receiver Operating Characteristics based on confidence ellipses", Computer Physics Communications, http://dx.doi.org/10.1016/j.cpc.2013.12.009

  1. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis.

    PubMed

    Bradley, Phelim; Gordon, N Claire; Walker, Timothy M; Dunn, Laura; Heys, Simon; Huang, Bill; Earle, Sarah; Pankhurst, Louise J; Anson, Luke; de Cesare, Mariateresa; Piazza, Paolo; Votintseva, Antonina A; Golubchik, Tanya; Wilson, Daniel J; Wyllie, David H; Diel, Roland; Niemann, Stefan; Feuerriegel, Silke; Kohl, Thomas A; Ismail, Nazir; Omar, Shaheed V; Smith, E Grace; Buck, David; McVean, Gil; Walker, A Sarah; Peto, Tim E A; Crook, Derrick W; Iqbal, Zamin

    2015-01-01

    The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package ('Mykrobe predictor') that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes. PMID:26686880

  2. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis

    PubMed Central

    Bradley, Phelim; Gordon, N. Claire; Walker, Timothy M.; Dunn, Laura; Heys, Simon; Huang, Bill; Earle, Sarah; Pankhurst, Louise J.; Anson, Luke; de Cesare, Mariateresa; Piazza, Paolo; Votintseva, Antonina A.; Golubchik, Tanya; Wilson, Daniel J.; Wyllie, David H.; Diel, Roland; Niemann, Stefan; Feuerriegel, Silke; Kohl, Thomas A.; Ismail, Nazir; Omar, Shaheed V.; Smith, E. Grace; Buck, David; McVean, Gil; Walker, A. Sarah; Peto, Tim E. A.; Crook, Derrick W.; Iqbal, Zamin

    2015-01-01

    The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package (‘Mykrobe predictor') that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes. PMID:26686880

  3. Deep-sequence profiling of miRNAs and their target prediction in Monotropa hypopitys.

    PubMed

    Shchennikova, Anna V; Beletsky, Alexey V; Shulga, Olga A; Mazur, Alexander M; Prokhortchouk, Egor B; Kochieva, Elena Z; Ravin, Nikolay V; Skryabin, Konstantin G

    2016-07-01

    Myco-heterotroph Monotropa hypopitys is a widely spread perennial herb used to study symbiotic interactions and physiological mechanisms underlying the development of non-photosynthetic plant. Here, we performed, for the first time, transcriptome-wide characterization of M. hypopitys miRNA profile using high throughput Illumina sequencing. As a result of small RNA library sequencing and bioinformatic analysis, we identified 55 members belonging to 40 families of known miRNAs and 17 putative novel miRNAs unique for M. hypopitys. Computational screening revealed 206 potential mRNA targets for known miRNAs and 31 potential mRNA targets for novel miRNAs. The predicted target genes were described in Gene Ontology terms and were found to be involved in a broad range of metabolic and regulatory pathways. The identification of novel M. hypopitys-specific miRNAs, some with few target genes and low abundances, suggests their recent evolutionary origin and participation in highly specialized regulatory mechanisms fundamental for non-photosynthetic biology of M. hypopitys. This global analysis of miRNAs and their potential targets in M. hypopitys provides a framework for further investigation of miRNA role in the evolution and establishment of non-photosynthetic myco-heterotrophs. PMID:27097902

  4. Temporal and Spatial Predictability of an Irrelevant Event Differently Affect Detection and Memory of Items in a Visual Sequence

    PubMed Central

    Ohyama, Junji; Watanabe, Katsumi

    2016-01-01

    We examined how the temporal and spatial predictability of a task-irrelevant visual event affects the detection and memory of a visual item embedded in a continuously changing sequence. Participants observed 11 sequentially presented letters, during which a task-irrelevant visual event was either present or absent. Predictabilities of spatial location and temporal position of the event were controlled in 2 × 2 conditions. In the spatially predictable conditions, the event occurred at the same location within the stimulus sequence or at another location, while, in the spatially unpredictable conditions, it occurred at random locations. In the temporally predictable conditions, the event timing was fixed relative to the order of the letters, while in the temporally unpredictable condition; it could not be predicted from the letter order. Participants performed a working memory task and a target detection reaction time (RT) task. Memory accuracy was higher for a letter simultaneously presented at the same location as the event in the temporally unpredictable conditions, irrespective of the spatial predictability of the event. On the other hand, the detection RTs were only faster for a letter simultaneously presented at the same location as the event when the event was both temporally and spatially predictable. Thus, to facilitate ongoing detection processes, an event must be predictable both in space and time, while memory processes are enhanced by temporally unpredictable (i.e., surprising) events. Evidently, temporal predictability has differential effects on detection and memory of a visual item embedded in a sequence of images. PMID:26869966

  5. Temporal and Spatial Predictability of an Irrelevant Event Differently Affect Detection and Memory of Items in a Visual Sequence.

    PubMed

    Ohyama, Junji; Watanabe, Katsumi

    2016-01-01

    We examined how the temporal and spatial predictability of a task-irrelevant visual event affects the detection and memory of a visual item embedded in a continuously changing sequence. Participants observed 11 sequentially presented letters, during which a task-irrelevant visual event was either present or absent. Predictabilities of spatial location and temporal position of the event were controlled in 2 × 2 conditions. In the spatially predictable conditions, the event occurred at the same location within the stimulus sequence or at another location, while, in the spatially unpredictable conditions, it occurred at random locations. In the temporally predictable conditions, the event timing was fixed relative to the order of the letters, while in the temporally unpredictable condition; it could not be predicted from the letter order. Participants performed a working memory task and a target detection reaction time (RT) task. Memory accuracy was higher for a letter simultaneously presented at the same location as the event in the temporally unpredictable conditions, irrespective of the spatial predictability of the event. On the other hand, the detection RTs were only faster for a letter simultaneously presented at the same location as the event when the event was both temporally and spatially predictable. Thus, to facilitate ongoing detection processes, an event must be predictable both in space and time, while memory processes are enhanced by temporally unpredictable (i.e., surprising) events. Evidently, temporal predictability has differential effects on detection and memory of a visual item embedded in a sequence of images. PMID:26869966

  6. Amino-terminal amino acid sequence of the major structural polypeptides of avian retroviruses: sequence homology between reticuloendotheliosis virus p30 and p30s of mammalian retroviruses.

    PubMed Central

    Hunter, E; Bhown, A S; Bennett, J C

    1978-01-01

    The major structural polypeptides, p30 of reticuloendotheliosis virus (REV) (strain T) and p27 of avian sarcoma virus B77, have been compared with regard to amino acid composition. NH2-terminal amino acid sequence, and immunological crossreactions. The amino acid composition of the two polypeptides is distinct, and a comparison of the first 30 NH2-terminal amino acids of REV p30 with that for the first 25 of B77 p27 yields only three homologous residues. In competition radioimmunoassays the polypeptides show no crossreactivity. A comparison of the amino acid composition and NH2-terminal amino acid sequence of REV p30 with those reported for several mammalian retrovirus p30s shows remarkable similarities. Both REV and mammalian p30s contain a large number of polar residues in their amino acid composition and show approximately 40% homology in the first 30 NH2-terminal amino acids. No crossreactivity could be observed, however, in competition radioimmunoassays between Rauscher murine leukemia virus p30 and that of REV. The observations reported here suggest a close evolutionary relationship between REV and the mammalian retroviruses. Images PMID:208072

  7. Purification and amino acid sequence of aminopeptidase P from pig kidney.

    PubMed

    Vergas Romero, C; Neudorfer, I; Mann, K; Schäfer, W

    1995-04-01

    Aminopeptidase P from kidney cortex was purified in high yield (recovery greater than or equal to 20%) by a series of column chromatographic steps after solubilization of the membrane-bound glycoprotein with n-butanol. A coupled enzymic assay, using Gly-Pro-Pro-NH-Nap as substrate and dipeptidyl-peptidase IV as auxilliary enzyme, was used to monitor the purification. The purification procedure yielded two forms of aminopeptidase P differing in their carbohydrate composition (glycoforms). Both enzyme preparations were homogeneous as assessed by SDS/PAGE silver staining, and isoelectric focusing. Both forms possessed the same substrate specificity, catalysed the same reaction, and consisted of identical protein chains. The amino acid sequence determined by Edman degradation and mass spectrometry consisted of 623 amino acids. Six N-glycosylation sites, all contained in the N-terminal half of the protein, were characterized. PMID:7744038

  8. Draft Genome Sequence of Cupriavidus sp. Strain SK-3, a 4-Chlorobiphenyl- and 4-Clorobenzoic Acid-Degrading Bacterium

    PubMed Central

    Vilo, Claudia; Benedik, Michael J.; Ilori, Matthew

    2014-01-01

    We report the draft genome sequence of Cupriavidus sp. strain SK-3, which can use 4-chlorobiphenyl and 4-clorobenzoic acid as the sole carbon source for growth. The draft genome sequence allowed the study of the polychlorinated biphenyl degradation mechanism and the recharacterization of the strain SK-3 as a Cupriavidus species. PMID:24994805

  9. Draft Genome Sequence of Bacillus subtilis subsp. natto Strain CGMCC 2108, a High Producer of Poly-γ-Glutamic Acid

    PubMed Central

    Tan, Siyuan; Su, Anping; Zhang, Chen; Ren, Yuanyuan

    2016-01-01

    Here, we report the 4.1-Mb draft genome sequence of Bacillus subtilis subsp. natto strain CGMCC 2108, a high producer of poly-γ-glutamic acid (γ-PGA). This sequence will provide further help for the biosynthesis of γ-PGA and will greatly facilitate research efforts in metabolic engineering of B. subtilis subsp. natto strain CGMCC 2108. PMID:27231363

  10. New monoclonal antibodies to the Ebola virus glycoprotein: Identification and analysis of the amino acid sequence of the variable domains.

    PubMed

    Panina, A A; Aliev, T K; Shemchukova, O B; Dement'yeva, I G; Varlamov, N E; Pozdnyakova, L P; Bokov, M N; Dolgikh, D A; Sveshnikov, P G; Kirpichnikov, M P

    2016-03-01

    We determined the nucleotide and amino acid sequences of variable domains of three new monoclonal antibodies to the glycoprotein of Ebola virus capsid. The framework and hypervariable regions of immunoglobulin heavy and light chains were identified. The primary structures were confirmed using massspectrometry analysis. Immunoglobulin database search showed the uniqueness of the sequences obtained. PMID:27193713

  11. Genome Sequence of the Lactic Acid Bacterium Lactococcus lactis subsp. lactis TOMSC161, Isolated from a Nonscalded Curd Pressed Cheese

    PubMed Central

    Velly, H.; Abraham, A.-L.; Loux, V.; Delacroix-Buchet, A.; Fonseca, F.; Bouix, M.

    2014-01-01

    Lactococcus lactis is a lactic acid bacterium used in the production of many fermented foods, such as dairy products. Here, we report the genome sequence of L. lactis subsp. lactis TOMSC161, isolated from nonscalded curd pressed cheese. This genome sequence provides information in relation to dairy environment adaptation. PMID:25377704

  12. Draft Genome Sequence of Bacillus subtilis subsp. natto Strain CGMCC 2108, a High Producer of Poly-γ-Glutamic Acid.

    PubMed

    Tan, Siyuan; Meng, Yonghong; Su, Anping; Zhang, Chen; Ren, Yuanyuan

    2016-01-01

    Here, we report the 4.1-Mb draft genome sequence of Bacillus subtilis subsp. natto strain CGMCC 2108, a high producer of poly-γ-glutamic acid (γ-PGA). This sequence will provide further help for the biosynthesis of γ-PGA and will greatly facilitate research efforts in metabolic engineering of B. subtilis subsp. natto strain CGMCC 2108. PMID:27231363

  13. Incorporating substrate sequence motifs and spatial amino acid composition to identify kinase-specific phosphorylation sites on protein three-dimensional structures

    PubMed Central

    2013-01-01

    Background Protein phosphorylation catalyzed by kinases plays crucial regulatory roles in cellular processes. Given the high-throughput mass spectrometry-based experiments, the desire to annotate the catalytic kinases for in vivo phosphorylation sites has motivated. Thus, a variety of computational methods have been developed for performing a large-scale prediction of kinase-specific phosphorylation sites. However, most of the proposed methods solely rely on the local amino acid sequences surrounding the phosphorylation sites. An increasing number of three-dimensional structures make it possible to physically investigate the structural environment of phosphorylation sites. Results In this work, all of the experimental phosphorylation sites are mapped to the protein entries of Protein Data Bank by sequence identity. It resulted in a total of 4508 phosphorylation sites containing the protein three-dimensional (3D) structures. To identify phosphorylation sites on protein 3D structures, this work incorporates support vector machines (SVMs) with the information of linear motifs and spatial amino acid composition, which is determined for each kinase group by calculating the relative frequencies of 20 amino acid types within a specific radial distance from central phosphorylated amino acid residue. After the cross-validation evaluation, most of the kinase-specific models trained with the consideration of structural information outperform the models considering only the sequence information. Furthermore, the independent testing set which is not included in training set has demonstrated that the proposed method could provide a comparable performance to other popular tools. Conclusion The proposed method is shown to be capable of predicting kinase-specific phosphorylation sites on 3D structures and has been implemented as a web server which is freely accessible at http://csb.cse.yzu.edu.tw/PhosK3D/. Due to the difficulty of identifying the kinase-specific phosphorylation

  14. ANTICALIgN: visualizing, editing and analyzing combined nucleotide and amino acid sequence alignments for combinatorial protein engineering.

    PubMed

    Jarasch, Alexander; Kopp, Melanie; Eggenstein, Evelyn; Richter, Antonia; Gebauer, Michaela; Skerra, Arne

    2016-07-01

    ANTIC ALIGN: is an interactive software developed to simultaneously visualize, analyze and modify alignments of DNA and/or protein sequences that arise during combinatorial protein engineering, design and selection. ANTIC ALIGN: combines powerful functions known from currently available sequence analysis tools with unique features for protein engineering, in particular the possibility to display and manipulate nucleotide sequences and their translated amino acid sequences at the same time. ANTIC ALIGN: offers both template-based multiple sequence alignment (MSA), using the unmutated protein as reference, and conventional global alignment, to compare sequences that share an evolutionary relationship. The application of similarity-based clustering algorithms facilitates the identification of duplicates or of conserved sequence features among a set of selected clones. Imported nucleotide sequences from DNA sequence analysis are automatically translated into the corresponding amino acid sequences and displayed, offering numerous options for selecting reading frames, highlighting of sequence features and graphical layout of the MSA. The MSA complexity can be reduced by hiding the conserved nucleotide and/or amino acid residues, thus putting emphasis on the relevant mutated positions. ANTIC ALIGN: is also able to handle suppressed stop codons or even to incorporate non-natural amino acids into a coding sequence. We demonstrate crucial functions of ANTIC ALIGN: in an example of Anticalins selected from a lipocalin random library against the fibronectin extradomain B (ED-B), an established marker of tumor vasculature. Apart from engineered protein scaffolds, ANTIC ALIGN: provides a powerful tool in the area of antibody engineering and for directed enzyme evolution. PMID:27261456

  15. Formation Sequences of Iron Minerals in the Acidic Alteration Products and Variation of Hydrothermal Fluid Conditions

    NASA Astrophysics Data System (ADS)

    Isobe, H.; Yoshizawa, M.

    2008-12-01

    Iron minerals have important role in environmental issues not only on the Earth but also other terrestrial planets. Iron mineral species related to alteration products of primary minerals with surface or subsurface fluids are characterized by temperature, acidity and redox conditions of the fluids. We can see various iron- bearing alteration products in alteration products around fumaroles in geothermal/volcanic areas. In this study, zonal structures of iron minerals in alteration products of the geothermal area are observed to elucidate temporal and spatial variation of hydrothermal fluids. Alteration of the pyroxene-amphibole andesite of Garan-dake volcano, Oita, Japan occurs by the acidic hydrothermal fluid to form cristobalite leaching out elements other than Si. Hand specimens with unaltered or weakly altered core and cristobalite crust show various sequences of layers. XRD analysis revealed that the alteration degree is represented by abundance of cristobalite. Intermediately altered layers are characterized by occurrence including alunite, pyrite, kaolinite, goethite and hematite. A specimen with reddish brown core surrounded by cristobalite-rich white crust has brown colored layers at the boundary of core and the crust. Reddish core is characterized by occurrence of crystalline hematite by XRD. Another hand specimen has light gray core, which represents reduced conditions, and white cristobalite crust with light brown and reddish brown layers of ferric iron minerals between the core and the crust. On the other hand, hornblende crystals, typical ferrous iron-bearing mineral of the host rock, are well preserved in some samples with strongly decolorized cristobalite-rich groundmass. Hydrothermal alteration experiments of iron-rich basaltic material shows iron mineral species depend on acidity and temperature of the fluid. Oxidation states of the iron-bearing mineral species are strongly influenced by the acidity and redox conditions. Variations of alteration

  16. Integration of Expressed Sequence Tag Data Flanking Predicted RNA Secondary Structures Facilitates Novel Non-Coding RNA Discovery

    PubMed Central

    Krzyzanowski, Paul M.; Price, Feodor D.; Muro, Enrique M.; Rudnicki, Michael A.; Andrade-Navarro, Miguel A.

    2011-01-01

    Many computational methods have been used to predict novel non-coding RNAs (ncRNAs), but none, to our knowledge, have explicitly investigated the impact of integrating existing cDNA-based Expressed Sequence Tag (EST) data that flank structural RNA predictions. To determine whether flanking EST data can assist in microRNA (miRNA) prediction, we identified genomic sites encoding putative miRNAs by combining functional RNA predictions with flanking ESTs data in a model consistent with miRNAs undergoing cleavage during maturation. In both human and mouse genomes, we observed that the inclusion of flanking ESTs adjacent to and not overlapping predicted miRNAs significantly improved the performance of various methods of miRNA prediction, including direct high-throughput sequencing of small RNA libraries. We analyzed the expression of hundreds of miRNAs predicted to be expressed during myogenic differentiation using a customized microarray and identified several known and predicted myogenic miRNA hairpins. Our results indicate that integrating ESTs flanking structural RNA predictions improves the quality of cleaved miRNA predictions and suggest that this strategy can be used to predict other non-coding RNAs undergoing cleavage during maturation. PMID:21698286

  17. Integration of expressed sequence tag data flanking predicted RNA secondary structures facilitates novel non-coding RNA discovery.

    PubMed

    Krzyzanowski, Paul M; Price, Feodor D; Muro, Enrique M; Rudnicki, Michael A; Andrade-Navarro, Miguel A

    2011-01-01

    Many computational methods have been used to predict novel non-coding RNAs (ncRNAs), but none, to our knowledge, have explicitly investigated the impact of integrating existing cDNA-based Expressed Sequence Tag (EST) data that flank structural RNA predictions. To determine whether flanking EST data can assist in microRNA (miRNA) prediction, we identified genomic sites encoding putative miRNAs by combining functional RNA predictions with flanking ESTs data in a model consistent with miRNAs undergoing cleavage during maturation. In both human and mouse genomes, we observed that the inclusion of flanking ESTs adjacent to and not overlapping predicted miRNAs significantly improved the performance of various methods of miRNA prediction, including direct high-throughput sequencing of small RNA libraries. We analyzed the expression of hundreds of miRNAs predicted to be expressed during myogenic differentiation using a customized microarray and identified several known and predicted myogenic miRNA hairpins. Our results indicate that integrating ESTs flanking structural RNA predictions improves the quality of cleaved miRNA predictions and suggest that this strategy can be used to predict other non-coding RNAs undergoing cleavage during maturation. PMID:21698286

  18. Multipolar Electrostatic Energy Prediction for all 20 Natural Amino Acids Using Kriging Machine Learning.

    PubMed

    Fletcher, Timothy L; Popelier, Paul L A

    2016-06-14

    A machine learning method called kriging is applied to the set of all 20 naturally occurring amino acids. Kriging models are built that predict electrostatic multipole moments for all topological atoms in any amino acid based on molecular geometry only. These models then predict molecular electrostatic interaction energies. On the basis of 200 unseen test geometries for each amino acid, no amino acid shows a mean prediction error above 5.3 kJ mol(-1), while the lowest error observed is 2.8 kJ mol(-1). The mean error across the entire set is only 4.2 kJ mol(-1) (or 1 kcal mol(-1)). Charged systems are created by protonating or deprotonating selected amino acids, and these show no significant deviation in prediction error over their neutral counterparts. Similarly, the proposed methodology can also handle amino acids with aromatic side chains, without the need for modification. Thus, we present a generic method capable of accurately capturing multipolar polarizable electrostatics in amino acids. PMID:27224739

  19. Multiple Amino Acid Sequence Alignment Nitrogenase Component 1: Insights into Phylogenetics and Structure-Function Relationships

    PubMed Central

    Howard, James B.; Kechris, Katerina J.; Rees, Douglas C.; Glazer, Alexander N.

    2013-01-01

    Amino acid residues critical for a protein's structure-function are retained by natural selection and these residues are identified by the level of variance in co-aligned homologous protein sequences. The relevant residues in the nitrogen fixation Component 1 α- and β-subunits were identified by the alignment of 95 protein sequences. Proteins were included from species encompassing multiple microbial phyla and diverse ecological niches as well as the nitrogen fixation genotypes, anf, nif, and vnf, which encode proteins associated with cofactors differing at one metal site. After adjusting for differences in sequence length, insertions, and deletions, the remaining >85% of the sequence co-aligned the subunits from the three genotypes. Six Groups, designated Anf, Vnf , and Nif I-IV, were assigned based upon genetic origin, sequence adjustments, and conserved residues. Both subunits subdivided into the same groups. Invariant and single variant residues were identified and were defined as “core” for nitrogenase function. Three species in Group Nif-III, Candidatus Desulforudis audaxviator, Desulfotomaculum kuznetsovii, and Thermodesulfatator indicus, were found to have a seleno-cysteine that replaces one cysteinyl ligand of the 8Fe:7S, P-cluster. Subsets of invariant residues, limited to individual groups, were identified; these unique residues help identify the gene of origin (anf, nif, or vnf) yet should not be considered diagnostic of the metal content of associated cofactors. Fourteen of the 19 residues that compose the cofactor pocket are invariant or single variant; the other five residues are highly variable but do not correlate with the putative metal content of the cofactor. The variable residues are clustered on one side of the cofactor, away from other functional centers in the three dimensional structure. Many of the invariant and single variant residues were not previously recognized as potentially critical and their identification provides the bases

  20. High Dietary Acid Load Predicts ESRD among Adults with CKD.

    PubMed

    Banerjee, Tanushree; Crews, Deidra C; Wesson, Donald E; Tilea, Anca M; Saran, Rajiv; Ríos-Burrows, Nilka; Williams, Desmond E; Powe, Neil R

    2015-07-01

    Small clinical trials have shown that a reduction in dietary acid load (DAL) improves kidney injury and slows kidney function decline; however, the relationship between DAL and risk of ESRD in a population-based cohort with CKD remains unexamined. We examined the association between DAL, quantified by net acid excretion (NAEes), and progression to ESRD in a nationally representative sample of adults in the United States. Among 1486 adults with CKD age≥20 years enrolled in the National Health and Nutrition Examination Survey III, DAL was determined by 24-h dietary recall questionnaire. The development of ESRD was ascertained over a median 14.2 years of follow-up through linkage with the Medicare ESRD Registry. We used the Fine-Gray competing risks method to estimate the association of high, medium, and low DAL with ESRD after adjusting for demographics, nutritional factors, clinical factors, and kidney function/damage markers and accounting for intervening mortality events. In total, 311 (20.9%) participants developed ESRD. Higher levels of DAL were associated with increased risk of ESRD; relative hazards (95% confidence interval) were 3.04 (1.58 to 5.86) for the highest tertile and 1.81 (0.89 to 3.68) for the middle tertile compared with the lowest tertile in the fully adjusted model. The risk of ESRD associated with DAL tertiles increased as eGFR decreased (P trend=0.001). Among participants with albuminuria, high DAL was strongly associated with ESRD risk (P trend=0.03). In conclusion, high DAL in persons with CKD is independently associated with increased risk of ESRD in a nationally representative population. PMID:25677388

  1. Predicting polarization and nonlinear dielectric response of arbitrary perovskite superlattice sequences

    NASA Astrophysics Data System (ADS)

    Wu, Xifan

    2008-03-01

    A complete theory of epitaxial perovskite superlattices requires an understanding both of epitaxial strain effects and of electrostatic boundary conditions. Here, focusing on the latter issue, weootnotetextIn collaboration with Massimiliano Stengel, Karin M. Rabe and David Vanderbilt. have carried out first-principles calculations of the nonlinear dielectric properties of short-period ``bicolor'' and ``tricolor'' CaTiO3/SrTiO3/BaTiO3 superlattices having the in-plane lattice constant of SrTiO3. In particular, we have calculated the layer polarizations pj as defined using the Wannier-based method of Wu, Di'eguez, Rabe and VanderbiltootnotetextX. Wu, O. Di'eguez, K. Rabe and D. Vanderbilt, Phys. Rev. Lett. 97, 107602 (2006). for each neutral BaO, SrO, CaO, or TiO2 layer. We use a cluster expansion (CE) technique to model the layer polarizations pj of a selected set of bicolor superlattices as a function of the displacement field D (which is uniform throughout the insulating superlattice), the chemical identity of the layer itself, and the chemical identity of its neighboring layers. We find that pj is a strongly localized function of its chemical environments at fixed D field, i.e., the dependence on the identity of the neighboring layers decays rapidly with distance. This localized property enables us to arrive at a truncated and simplified CE model which can accurately predict pj(D) in arbitrary layer sequences, both bicolor and tricolor. A similar approach is used to model the dependence of the c lattice constant. With all this information in hand, we can predict the polarization, piezoelectric and nonlinear dielectric response of arbitrary superlattice sequences. The power of the approach is demonstrated by showing that a model fitted only to calculations on inversion-symmetric bi-color superlattices can successfully predict the inversion symmetry breaking in tricolor superlattices such as 2SrTiO3/1BaTiO3/1CaTiO3.

  2. Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction.

    PubMed

    Brøndum, R F; Su, G; Janss, L; Sahana, G; Guldbrandtsen, B; Boichard, D; Lund, M S

    2015-06-01

    This study investigated the effect on the reliability of genomic prediction when a small number of significant variants from single marker analysis based on whole genome sequence data were added to the regular 54k single nucleotide polymorphism (SNP) array data. The extra markers were selected with the aim of augmenting the custom low-density Illumina BovineLD SNP chip (San Diego, CA) used in the Nordic countries. The single-marker analysis was done breed-wise on all 16 index traits included in the breeding goals for Nordic Holstein, Danish Jersey, and Nordic Red cattle plus the total merit index itself. Depending on the trait's economic weight, 15, 10, or 5 quantitative trait loci (QTL) were selected per trait per breed and 3 to 5 markers were selected to tag each QTL. After removing duplicate markers (same marker selected for more than one trait or breed) and filtering for high pairwise linkage disequilibrium and assaying performance on the array, a total of 1,623 QTL markers were selected for inclusion on the custom chip. Genomic prediction analyses were performed for Nordic and French Holstein and Nordic Red animals using either a genomic BLUP or a Bayesian variable selection model. When using the genomic BLUP model including the QTL markers in the analysis, reliability was increased by up to 4 percentage points for production traits in Nordic Holstein animals, up to 3 percentage points for Nordic Reds, and up to 5 percentage points for French Holstein. Smaller gains of up to 1 percentage point was observed for mastitis, but only a 0.5 percentage point increase was seen for fertility. When using a Bayesian model accuracies were generally higher with only 54k data compared with the genomic BLUP approach, but increases in reliability were relatively smaller when QTL markers were included. Results from this study indicate that the reliability of genomic prediction can be increased by including markers significant in genome-wide association studies on whole genome

  3. Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest.

    PubMed

    You, Zhu-Hong; Chan, Keith C C; Hu, Pengwei

    2015-01-01

    The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future

  4. Draft Genome Sequences of Gluconobacter cerinus CECT 9110 and Gluconobacter japonicus CECT 8443, Acetic Acid Bacteria Isolated from Grape Must

    PubMed Central

    Sainz, Florencia

    2016-01-01

    We report here the draft genome sequences of Gluconobacter cerinus strain CECT9110 and Gluconobacter japonicus CECT8443, acetic acid bacteria isolated from grape must. Gluconobacter species are well known for their ability to oxidize sugar alcohols into the corresponding acids. Our objective was to select strains to oxidize effectively d-glucose. PMID:27365351

  5. Prediction of Scylla olivacea (Crustacea; Brachyura) peptide hormones using publicly accessible transcriptome shotgun assembly (TSA) sequences.

    PubMed

    Christie, Andrew E

    2016-05-01

    The aquaculture of crabs from the genus Scylla is of increasing economic importance for many Southeast Asian countries. Expansion of Scylla farming has led to increased efforts to understand the physiology and behavior of these crabs, and as such, there are growing molecular resources for them. Here, publicly accessible Scylla olivacea transcriptomic data were mined for putative peptide-encoding transcripts; the proteins deduced from the identified sequences were then used to predict the structures of mature peptide hormones. Forty-nine pre/preprohormone-encoding transcripts were identified, allowing for the prediction of 187 distinct mature peptides. The identified peptides included isoforms of adipokinetic hormone-corazonin-like peptide, allatostatin A, allatostatin B, allatostatin C, bursicon β, CCHamide, corazonin, crustacean cardioactive peptide, crustacean hyperglycemic hormone/molt-inhibiting hormone, diuretic hormone 31, eclosion hormone, FMRFamide-like peptide, HIGSLYRamide, insulin-like peptide, intocin, leucokinin, myosuppressin, neuroparsin, neuropeptide F, orcokinin, pigment dispersing hormone, pyrokinin, red pigment concentrating hormone, RYamide, short neuropeptide F, SIFamide and tachykinin-related peptide, all well-known neuropeptide families. Surprisingly, the tissue used to generate the transcriptome mined here is reported to be testis. Whether or not the testis samples had neural contamination is unknown. However, if the peptides are truly produced by this reproductive organ, it could have far reaching consequences for the study of crustacean endocrinology, particularly in the area of reproductive control. Regardless, this peptidome is the largest thus far predicted for any brachyuran (true crab) species, and will serve as a foundation for future studies of peptidergic control in members of the commercially important genus Scylla. PMID:26965954

  6. From amino acid sequence to bioactivity: The biomedical potential of antitumor peptides.

    PubMed

    Blanco-Míguez, Aitor; Gutiérrez-Jácome, Alberto; Pérez-Pérez, Martín; Pérez-Rodríguez, Gael; Catalán-García, Sandra; Fdez-Riverola, Florentino; Lourenço, Anália; Sánchez, Borja

    2016-06-01

    Chemoprevention is the use of natural and/or synthetic substances to block, reverse, or retard the process of carcinogenesis. In this field, the use of antitumor peptides is of interest as, (i) these molecules are small in size, (ii) they show good cell diffusion and permeability, (iii) they affect one or more specific molecular pathways involved in carcinogenesis, and (iv) they are not usually genotoxic. We have checked the Web of Science Database (23/11/2015) in order to collect papers reporting on bioactive peptide (1691 registers), which was further filtered searching terms such as "antiproliferative," "antitumoral," or "apoptosis" among others. Works reporting the amino acid sequence of an antiproliferative peptide were kept (60 registers), and this was complemented with the peptides included in CancerPPD, an extensive resource for antiproliferative peptides and proteins. Peptides were grouped according to one of the following mechanism of action: inhibition of cell migration, inhibition of tumor angiogenesis, antioxidative mechanisms, inhibition of gene transcription/cell proliferation, induction of apoptosis, disorganization of tubulin structure, cytotoxicity, or unknown mechanisms. The main mechanisms of action of those antiproliferative peptides with known amino acid sequences are presented and finally, their potential clinical usefulness and future challenges on their application is discussed. PMID:27010507

  7. The amino acid sequences and activities of synergistic hemolysins from Staphylococcus cohnii.

    PubMed

    Mak, Pawel; Maszewska, Agnieszka; Rozalska, Malgorzata

    2008-10-01

    Staphylococcus cohnii ssp. cohnii and S. cohnii ssp. urealyticus are a coagulase-negative staphylococci considered for a long time as unable to cause infections. This situation changed recently and pathogenic strains of these bacteria were isolated from hospital environments, patients and medical staff. Most of the isolated strains were resistant to many antibiotics. The present work describes isolation and characterization of several synergistic peptide hemolysins produced by these bacteria and acting as virulence factors responsible for hemolytic and cytotoxic activities. Amino acid sequences of respective hemolysins from S. cohnii ssp. cohnii (named as H1C, H2C and H3C) and S. cohnii ssp. urealyticus (H1U, H2U and H3U) were identical. Peptides H1 and H3 possessed significant amino acid homology to three synergistic hemolysins secreted by Staphylococcus lugdunensis and to putative antibacterial peptide produced by Staphylococcus saprophyticus ssp. saprophyticus. On the other hand, hemolysin H2 had a unique sequence. All isolated peptides lysed red cells from different mammalian species and exerted a cytotoxic effect on human fibroblasts. PMID:18752624

  8. Clostridium sticklandii, a specialist in amino acid degradation:revisiting its metabolism through its genome sequence

    PubMed Central

    2010-01-01

    Background Clostridium sticklandii belongs to a cluster of non-pathogenic proteolytic clostridia which utilize amino acids as carbon and energy sources. Isolated by T.C. Stadtman in 1954, it has been generally regarded as a "gold mine" for novel biochemical reactions and is used as a model organism for studying metabolic aspects such as the Stickland reaction, coenzyme-B12- and selenium-dependent reactions of amino acids. With the goal of revisiting its carbon, nitrogen, and energy metabolism, and comparing studies with other clostridia, its genome has been sequenced and analyzed. Results C. sticklandii is one of the best biochemically studied proteolytic clostridial species. Useful additional information has been obtained from the sequencing and annotation of its genome, which is presented in this paper. Besides, experimental procedures reveal that C. sticklandii degrades amino acids in a preferential and sequential way. The organism prefers threonine, arginine, serine, cysteine, proline, and glycine, whereas glutamate, aspartate and alanine are excreted. Energy conservation is primarily obtained by substrate-level phosphorylation in fermentative pathways. The reactions catalyzed by different ferredoxin oxidoreductases and the exergonic NADH-dependent reduction of crotonyl-CoA point to a possible chemiosmotic energy conservation via the Rnf complex. C. sticklandii possesses both the F-type and V-type ATPases. The discovery of an as yet unrecognized selenoprotein in the D-proline reductase operon suggests a more detailed mechanism for NADH-dependent D-proline reduction. A rather unusual metabolic feature is the presence of genes for all the enzymes involved in two different CO2-fixation pathways: C. sticklandii harbours both the glycine synthase/glycine reductase and the Wood-Ljungdahl pathways. This unusual pathway combination has retrospectively been observed in only four other sequenced microorganisms. Conclusions Analysis of the C. sticklandii genome and

  9. Complete amino acid sequence of the myoglobin from the Pacific spotted dolphin, Stenella attenuata graffmani.

    PubMed

    Jones, B N; Wang, C C; Dwulet, F E; Lehman, L D; Meuth, J L; Bogardt, R A; Gurd, F R

    1979-04-25

    The complete amino acid sequence of the major component myoglobin from the Pacific spotted dolphin, Stenella attenuata graffmani, was determined by the automated Edman degradation of several large peptides obtained by specific cleavage of the protein. The acetimidated apomyoglobin was selectively cleaved at its two methionyl residues with cyanogen bromide and at its three arginyl residues by trypsin. By subjecting four of these peptides and the apomyoglobin to automated Edman degradation, over 80% of the primary structure of the protein was obtained. The remainder of the covalent structure was determined by the sequence analysis of peptides that resulted from further digestion of the central cyanogen bromide fragment. This fragment was cleaved at its glutamyl residues with staphylococcal protease and its lysyl residues with trypsin. The action of trypsin was restricted to the lysyl residues by chemical modification of the single arginyl residue of the fragment with 1,2-cyclohexanedione. The primary structure of this myoglobin proved to be identical with that from the Atlantic bottlenosed dolphin and Pacific common dolphin but differs from the myoglobins of the killer whale and pilot whale at two positions. The above sequence identities and differences reflect the close taxonomic relationship of these five species of Cetacea. PMID:454657

  10. Isolation and amino acid sequences of squirrel monkey (Saimiri sciurea) insulin and glucagon

    SciTech Connect

    Yu, Jinghua ); Eng, J.; Yalow, R.S. City Univ. of New York, NY )

    1990-12-01

    It was reported two decades ago that insulin was not detectable in the glucose-stimulated state in Saimiri sciurea, the New World squirrel monkey, by a radioimmunoassay system developed with guinea pig anti-pork insulin antibody and labeled park insulin. With the same system, reasonable levels were observed in rhesus monkeys and chimpanzees. This suggested that New World monkeys, like the New World hystricomorph rodents such as the guinea pig and the coypu, might have insulins whose sequences differ markedly from those of Old World mammals. In this report the authors describe the purification and amino acid sequences of squirrel monkey insulin and glucagon. They demonstrate that the substitutions at B29, B27, A2, A4, and A17 of squirrel monkey insulin are identical with those previously found in another New World primate, the owl monkey (Aotus trivirgatus). The immunologic cross-reactivity of this insulin in their immunoassay system is only a few percent of that of human insulin. It appears that the peptides of the New World monkeys have diverged less from those of the Old World mammals than have those of the New World hystricomorph rodents. The striking improvements in peptide purification and sequencing have the potential for adding new information concerning the evolutionary divergence of species.

  11. Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models

    PubMed Central

    Maaskola, Jonas; Rajewsky, Nikolaus

    2014-01-01

    We present a discriminative learning method for pattern discovery of binding sites in nucleic acid sequences based on hidden Markov models. Sets of positive and negative example sequences are mined for sequence motifs whose occurrence frequency varies between the sets. The method offers several objective functions, but we concentrate on mutual information of condition and motif occurrence. We perform a systematic comparison of our method and numerous published motif-finding tools. Our method achieves the highest motif discovery performance, while being faster than most published methods. We present case studies of data from various technologies, including ChIP-Seq, RIP-Chip and PAR-CLIP, of embryonic stem cell transcription factors and of RNA-binding proteins, demonstrating practicality and utility of the method. For the alternative splicing factor RBM10, our analysis finds motifs known to be splicing-relevant. The motif discovery method is implemented in the free software package Discrover. It is applicable to genome- and transcriptome-scale data, makes use of available repeat experiments and aside from binary contrasts also more complex data configurations can be utilized. PMID:25389269

  12. Nucleotide and derived amino acid sequences of the major porin of Comamonas acidovorans and comparison of porin primary structures.

    PubMed Central

    Gerbl-Rieger, S; Peters, J; Kellermann, J; Lottspeich, F; Baumeister, W

    1991-01-01

    The DNA sequence of the gene which codes for the major outer membrane porin (Omp32) of Comamonas acidovorans has been determined. The structural gene encodes a precursor consisting of 351 amino acid residues with a signal peptide of 19 amino acid residues. Comparisons with amino acid sequences of outer membrane proteins and porins from several other members of the class Proteobacteria and of the Chlamydia trachomatis porin and the Neurospora crassa mitochondrial porin revealed a motif of eight regions of local homology. The results of this analysis are discussed with regard to common structural features of porins. PMID:1848840

  13. Amino acid residue doublet propensity in the protein-RNA interface and its application to RNA interface prediction.

    PubMed

    Kim, Oanh T P; Yura, Kei; Go, Nobuhiro

    2006-01-01

    Protein-RNA interactions play essential roles in a number of regulatory mechanisms for gene expression such as RNA splicing, transport, translation and post-transcriptional control. As the number of available protein-RNA complex 3D structures has increased, it is now possible to statistically examine protein-RNA interactions based on 3D structures. We performed computational analyses of 86 representative protein-RNA complexes retrieved from the Protein Data Bank. Interface residue propensity, a measure of the relative importance of different amino acid residues in the RNA interface, was calculated for each amino acid residue type (residue singlet interface propensity). In addition to the residue singlet propensity, we introduce a new residue-based propensity, which gives a measure of residue pairing preferences in the RNA interface of a protein (residue doublet interface propensity). The residue doublet interface propensity contains much more information than the sum of two singlet propensities alone. The prediction of the RNA interface using the two types of propensities plus a position-specific multiple sequence profile can achieve a specificity of about 80%. The prediction method was then applied to the 3D structure of two mRNA export factors, TAP (Mex67) and UAP56 (Sub2). The prediction enables us to point out candidate RNA interfaces, part of which are consistent with previous experimental studies and may contribute to elucidation of atomic mechanisms of mRNA export. PMID:17130160

  14. Integrating bioinformatic resources to predict transcription factors interacting with cis-sequences conserved in co-regulated genes

    PubMed Central

    2014-01-01

    Background Using motif detection programs it is fairly straightforward to identify conserved cis-sequences in promoters of co-regulated genes. In contrast, the identification of the transcription factors (TFs) interacting with these cis-sequences is much more elaborate. To facilitate this, we explore the possibility of using several bioinformatic and experimental approaches for TF identification. This starts with the selection of co-regulated gene sets and leads first to the prediction and then to the experimental validation of TFs interacting with cis-sequences conserved in the promoters of these co-regulated genes. Results Using the PathoPlant database, 32 up-regulated gene groups were identified with microarray data for drought-responsive gene expression from Arabidopsis thaliana. Application of the binding site estimation suite of tools (BEST) discovered 179 conserved sequence motifs within the corresponding promoters. Using the STAMP web-server, 49 sequence motifs were classified into 7 motif families for which similarities with known cis-regulatory sequences were identified. All motifs were subjected to a footprintDB analysis to predict interacting DNA binding domains from plant TF families. Predictions were confirmed by using a yeast-one-hybrid approach to select interacting TFs belonging to the predicted TF families. TF-DNA interactions were further experimentally validated in yeast and with a Physcomitrella patens transient expression system, leading to the discovery of several novel TF-DNA interactions. Conclusions The present work demonstrates the successful integration of several bioinformatic resources with experimental approaches to predict and validate TFs interacting with conserved sequence motifs in co-regulated genes. PMID:24773781

  15. Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes

    PubMed Central

    Régnier, Mireille; Chassignet, Philippe

    2016-01-01

    Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences. PMID:27376057

  16. Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes.

    PubMed

    Régnier, Mireille; Chassignet, Philippe

    2016-01-01

    Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences. PMID:27376057

  17. Prediction of Antimicrobial Peptides Based on Sequence Alignment and Support Vector Machine-Pairwise Algorithm Utilizing LZ-Complexity

    PubMed Central

    Shahrudin, Shahriza

    2015-01-01

    This study concerns an attempt to establish a new method for predicting antimicrobial peptides (AMPs) which are important to the immune system. Recently, researchers are interested in designing alternative drugs based on AMPs because they have found that a large number of bacterial strains have become resistant to available antibiotics. However, researchers have encountered obstacles in the AMPs designing process as experiments to extract AMPs from protein sequences are costly and require a long set-up time. Therefore, a computational tool for AMPs prediction is needed to resolve this problem. In this study, an integrated algorithm is newly introduced to predict AMPs by integrating sequence alignment and support vector machine- (SVM-) LZ complexity pairwise algorithm. It was observed that, when all sequences in the training set are used, the sensitivity of the proposed algorithm is 95.28% in jackknife test and 87.59% in independent test, while the sensitivity obtained for jackknife test and independent test is 88.74% and 78.70%, respectively, when only the sequences that has less than 70% similarity are used. Applying the proposed algorithm may allow researchers to effectively predict AMPs from unknown protein peptide sequences with higher sensitivity. PMID:25802839

  18. Sequence comparison, molecular modeling, and network analysis predict structural diversity in cysteine proteases from the Cape sundew, Drosera capensis.

    PubMed

    Butts, Carter T; Zhang, Xuhong; Kelly, John E; Roskamp, Kyle W; Unhelkar, Megha H; Freites, J Alfredo; Tahir, Seemal; Martin, Rachel W

    2016-01-01

    Carnivorous plants represent a so far underexploited reservoir of novel proteases with potentially useful activities. Here we investigate 44 cysteine proteases from the Cape sundew, Drosera capensis, predicted from genomic DNA sequences. D. capensis has a large number of cysteine protease genes; analysis of their sequences reveals homologs of known plant proteases, some of which are predicted to have novel properties. Many functionally significant sequence and structural features are observed, including targeting signals and occluding loops. Several of the proteases contain a new type of granulin domain. Although active site residues are conserved, the sequence identity of these proteases to known proteins is moderate to low; therefore, comparative modeling with all-atom refinement and subsequent atomistic MD-simulation is used to predict their 3D structures. The structure prediction data, as well as analysis of protein structure networks, suggest multifarious variations on the papain-like cysteine protease structural theme. This in silico methodology provides a general framework for investigating a large pool of sequences that are potentially useful for biotechnology applications, enabling informed choices about which proteins to investigate in the laboratory. PMID:27471585

  19. Uric Acid Levels Can Predict Metabolic Syndrome and Hypertension in Adolescents: A 10-Year Longitudinal Study

    PubMed Central

    Sun, Hai-Lun; Pei, Dee; Lue, Ko-Huang; Chen, Yen-Lin

    2015-01-01

    The relationships between uric acid and chronic disease risk factors such as metabolic syndrome, type 2 diabetes mellitus, and hypertension have been studied in adults. However, whether these relationships exist in adolescents is unknown. We randomly selected 8,005 subjects who were between 10 to 15 years old at baseline. Measurements of uric acid were used to predict the future occurrence of metabolic syndrome, hypertension, and type 2 diabetes. In total, 5,748 adolescents were enrolled and followed for a median of 7.2 years. Using cutoff points of uric acid for males and females (7.3 and 6.2 mg/dl, respectively), a high level of uric acid was either the second or third best predictor for hypertension in both genders (hazard ratio: 2.920 for males, 5.222 for females; p<0.05). However, uric acid levels failed to predict type 2 diabetes mellitus, and only predicted metabolic syndrome in males (hazard ratio: 1.658; p<0.05). The same results were found in multivariate adjusted analysis. In conclusion, a high level of uric acid indicated a higher likelihood of developing hypertension in both genders and metabolic syndrome in males after 10 years of follow-up. However, uric acid levels did not affect the occurrence of type 2 diabetes in both genders. PMID:26618358

  20. Uric Acid Levels Can Predict Metabolic Syndrome and Hypertension in Adolescents: A 10-Year Longitudinal Study.

    PubMed

    Sun, Hai-Lun; Pei, Dee; Lue, Ko-Huang; Chen, Yen-Lin

    2015-01-01

    The relationships between uric acid and chronic disease risk factors such as metabolic syndrome, type 2 diabetes mellitus, and hypertension have been studied in adults. However, whether these relationships exist in adolescents is unknown. We randomly selected 8,005 subjects who were between 10 to 15 years old at baseline. Measurements of uric acid were used to predict the future occurrence of metabolic syndrome, hypertension, and type 2 diabetes. In total, 5,748 adolescents were enrolled and followed for a median of 7.2 years. Using cutoff points of uric acid for males and females (7.3 and 6.2 mg/dl, respectively), a high level of uric acid was either the second or third best predictor for hypertension in both genders (hazard ratio: 2.920 for males, 5.222 for females; p<0.05). However, uric acid levels failed to predict type 2 diabetes mellitus, and only predicted metabolic syndrome in males (hazard ratio: 1.658; p<0.05). The same results were found in multivariate adjusted analysis. In conclusion, a high level of uric acid indicated a higher likelihood of developing hypertension in both genders and metabolic syndrome in males after 10 years of follow-up. However, uric acid levels did not affect the occurrence of type 2 diabetes in both genders. PMID:26618358

  1. Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon.

    PubMed

    Feinauer, Christoph; Szurmant, Hendrik; Weigt, Martin; Pagnani, Andrea

    2016-01-01

    Interaction between proteins is a fundamental mechanism that underlies virtually all biological processes. Many important interactions are conserved across a large variety of species. The need to maintain interaction leads to a high degree of co-evolution between residues in the interface between partner proteins. The inference of protein-protein interaction networks from the rapidly growing sequence databases is one of the most formidable tasks in systems biology today. We propose here a novel approach based on the Direct-Coupling Analysis of the co-evolution between inter-protein residue pairs. We use ribosomal and trp operon proteins as test cases: For the small resp. large ribosomal subunit our approach predicts protein-interaction partners at a true-positive rate of 70% resp. 90% within the first 10 predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all predictions. In the trp operon, it assigns the two largest interaction scores to the only two interactions experimentally known. On the level of residue interactions we show that for both the small and the large ribosomal subunit our approach predicts interacting residues in the system with a true positive rate of 60% and 85% in the first 20 predictions. We use artificial data to show that the performance of our approach depends crucially on the size of the joint multiple sequence alignments and analyze how many sequences would be necessary for a perfect prediction if the sequences were sampled from the same model that we use for prediction. Given the performance of our approach on the test data we speculate that it can be used to detect new interactions, especially in the light of the rapid growth of available sequence data. PMID:26882169

  2. Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon

    PubMed Central

    Feinauer, Christoph; Szurmant, Hendrik; Weigt, Martin; Pagnani, Andrea

    2016-01-01

    Interaction between proteins is a fundamental mechanism that underlies virtually all biological processes. Many important interactions are conserved across a large variety of species. The need to maintain interaction leads to a high degree of co-evolution between residues in the interface between partner proteins. The inference of protein-protein interaction networks from the rapidly growing sequence databases is one of the most formidable tasks in systems biology today. We propose here a novel approach based on the Direct-Coupling Analysis of the co-evolution between inter-protein residue pairs. We use ribosomal and trp operon proteins as test cases: For the small resp. large ribosomal subunit our approach predicts protein-interaction partners at a true-positive rate of 70% resp. 90% within the first 10 predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all predictions. In the trp operon, it assigns the two largest interaction scores to the only two interactions experimentally known. On the level of residue interactions we show that for both the small and the large ribosomal subunit our approach predicts interacting residues in the system with a true positive rate of 60% and 85% in the first 20 predictions. We use artificial data to show that the performance of our approach depends crucially on the size of the joint multiple sequence alignments and analyze how many sequences would be necessary for a perfect prediction if the sequences were sampled from the same model that we use for prediction. Given the performance of our approach on the test data we speculate that it can be used to detect new interactions, especially in the light of the rapid growth of available sequence data. PMID:26882169

  3. gDNA-Prot: Predict DNA-binding proteins by employing support vector machine and a novel numerical characterization of protein sequence.

    PubMed

    Zhang, Yan-Ping; Wuyunqiqige; Zheng, Wei; Liu, Shuyi; Zhao, Chunguang

    2016-10-01

    DNA-binding proteins are the functional proteins in cells, which play an important role in various essential biological activities. An effective and fast computational method gDNA-Prot is proposed to predict DNA-binding proteins in this paper, which is a DNA-binding predictor that combines the support vector machine classifier and a novel kind of feature called graphical representation. The DNA-binding protein sequence information was described with the 20 probabilities of amino acids and the 23 new numerical graphical representation features of a protein sequence, based on 23 physicochemical properties of 20 amino acids. The Principal Components Analysis (PCA) was employed as feature selection method for removing the irrelevant features and reducing redundant features. The Sigmod function and Min-max normalization methods for PCA were applied to accelerate the training speed and obtain higher accuracy. Experiments demonstrated that the Principal Components Analysis with Sigmod function generated the best performance. The gDNA-Prot method was also compared with the DNAbinder, iDNA-Prot and DNA-Prot. The results suggested that gDNA-Prot outperformed the DNAbinder and iDNA-Prot. Although the DNA-Prot outperformed gDNA-Prot, gDNA-Prot was faster and convenient to predict the DNA-binding proteins. Additionally, the proposed gNDA-Prot method is available at http://sourceforge.net/projects/gdnaprot. PMID:27378005

  4. Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks

    PubMed Central

    Cao, Renzhi; Cheng, Jianlin

    2016-01-01

    Motivations Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein–protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene–gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. Results In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein–protein interaction and spatial gene–gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein–protein interaction and spatial gene–gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile–sequence comparison, profile–profile comparison, and domain co-occurrence networks according to the maximum F-measure. PMID:26370280

  5. Calibration and prediction of amino acids in stevia leaf powder using near infrared reflectance spectroscopy.

    PubMed

    Li, Guan; Wang, Ruiguo; Quampah, Alfred Julius; Rong, Zhengqin; Shi, Chunhai; Wu, Jianguo

    2011-12-28

    The use of stevia as animal feed additive has been researched over the years, but how to rapidly predict its amino acid contents has not been studied yet by using near-infrared reflectance spectroscopy. In the present study, 301 samples of stevia leaf powder were defined as the calibration set from which calibration models were optimized, and the performance of prediction was evaluated. Compared with other mathematical treatments, the models developed with the "1, 12, 12, 1" treatment, combined with modified partial least-squares regression and standard normal variance with de-trending, had a significant potential in predicting amino acid contents, such as threonine, serine, etc. Six spectral regions were found to possess large spectrum variation and show high contribution to calibration models. From the present study, the calibration models of amino acids in stevia were successfully developed and could be applied to quality control in feed processing, breeding selection and mutant screening. PMID:22066716

  6. Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score.

    PubMed

    Miao, Zhichao; Westhof, Eric

    2015-06-23

    We describe a general binding score for predicting the nucleic acid binding probability in proteins. The score is directly derived from physicochemical and evolutionary features and integrates a residue neighboring network approach. Our process achieves stable and high accuracies on both DNA- and RNA-binding proteins and illustrates how the main driving forces for nucleic acid binding are common. Because of the effective integration of the synergetic effects of the network of neighboring residues and the fact that the prediction yields a hierarchical scoring on the protein surface, energy funnels for nucleic acid binding appear on protein surfaces, pointing to the dynamic process occurring in the binding of nucleic acids to proteins. PMID:25940624

  7. Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score

    PubMed Central

    Miao, Zhichao; Westhof, Eric

    2015-01-01

    We describe a general binding score for predicting the nucleic acid binding probability in proteins. The score is directly derived from physicochemical and evolutionary features and integrates a residue neighboring network approach. Our process achieves stable and high accuracies on both DNA- and RNA-binding proteins and illustrates how the main driving forces for nucleic acid binding are common. Because of the effective integration of the synergetic effects of the network of neighboring residues and the fact that the prediction yields a hierarchical scoring on the protein surface, energy funnels for nucleic acid binding appear on protein surfaces, pointing to the dynamic process occurring in the binding of nucleic acids to proteins. PMID:25940624

  8. Prediction of liquid-liquid equilibrium for systems of vegetable oils, fatty acids, and ethanol

    SciTech Connect

    Batista, E.; Monnerat, S.; Stragevitch, L.; Pina, C.G.; Goncalves, C.B.; Meirelles, A.J.A.

    1999-12-01

    Group interaction parameters for the UNIFAC and ASOG models were specially adjusted for predicting liquid-liquid equilibrium (LLE) for systems of vegetable oils, fatty acids, and ethanol at temperatures ranging from 20 to 45 C. Experimental liquid-liquid equilibrium data for systems of triolein, oleic acid, and ethanol and of triolein, stearic acid, and ethanol were measured and utilized in the adjustment. The average percent deviation between experimental and calculated compositions was 0.79% and 0.52% for the UNIFAC and ASOG models, respectively. The prediction of liquid-liquid equilibrium for systems of vegetable oils, fatty acids, and ethanol was quite successful, with an average deviation of 1.31% and 1.32% for the UNIFAC and ASOG models, respectively.

  9. The Genome Sequence of the Highly Acetic Acid-Tolerant Zygosaccharomyces bailii-Derived Interspecies Hybrid Strain ISA1307, Isolated From a Sparkling Wine Plant

    PubMed Central

    Mira, Nuno P.; Münsterkötter, Martin; Dias-Valada, Filipa; Santos, Júlia; Palma, Margarida; Roque, Filipa C.; Guerreiro, Joana F.; Rodrigues, Fernando; Sousa, Maria João; Leão, Cecília; Güldener, Ulrich; Sá-Correia, Isabel

    2014-01-01

    In this work, it is described the sequencing and annotation of the genome of the yeast strain ISA1307, isolated from a sparkling wine continuous production plant. This strain, formerly considered of the Zygosaccharomyces bailii species, has been used to study Z. bailii physiology, in particular, its extreme tolerance to acetic acid stress at low pH. The analysis of the genome sequence described in this work indicates that strain ISA1307 is an interspecies hybrid between Z. bailii and a closely related species. The genome sequence of ISA1307 is distributed through 154 scaffolds and has a size of around 21.2 Mb, corresponding to 96% of the genome size estimated by flow cytometry. Annotation of ISA1307 genome includes 4385 duplicated genes (∼90% of the total number of predicted genes) and 1155 predicted single-copy genes. The functional categories including a higher number of genes are ‘Metabolism and generation of energy’, ‘Protein folding, modification and targeting’ and ‘Biogenesis of cellular components’. The knowledge of the genome sequence of the ISA1307 strain is expected to contribute to accelerate systems-level understanding of stress resistance mechanisms in Z. bailii and to inspire and guide novel biotechnological applications of this yeast species/strain in fermentation processes, given its high resilience to acidic stress. The availability of the ISA1307 genome sequence also paves the way to a better understanding of the genetic mechanisms underlying the generation and selection of more robust hybrid yeast strains in the stressful environment of wine fermentations. PMID:24453040

  10. Predicting whole genome protein interaction networks from primary sequence data in model and non-model organisms using ENTS

    PubMed Central

    2013-01-01

    Background The large-scale identification of physical protein-protein interactions (PPIs) is an important step toward understanding how biological networks evolve and generate emergent phenotypes. However, experimental identification of PPIs is a laborious and error-prone process, and current methods of PPI prediction tend to be highly conservative or require large amounts of functional data that may not be available for newly-sequenced organisms. Results In this study we demonstrate a random-forest based technique, ENTS, for the computational prediction of protein-protein interactions based only on primary sequence data. Our approach is able to efficiently predict interactions on a whole-genome scale for any eukaryotic organism, using pairwise combinations of conserved domains and predicted subcellular localization of proteins as input features. We present the first predicted interactome for the forest tree Populus trichocarpa in addition to the predicted interactomes for Saccharomyces cerevisiae, Homo sapiens, Mus musculus, and Arabidopsis thaliana. Comparing our approach to other PPI predictors, we find that ENTS performs comparably to or better than a number of existing approaches, including several that utilize a variety of functional information for their predictions. We also find that the predicted interactions are biologically meaningful, as indicated by similarity in functional annotations and enrichment of co-expressed genes in public microarray datasets. Furthermore, we demonstrate some of the biological insights that can be gained from these predicted interaction networks. We show that the predicted interactions yield informative groupings of P. trichocarpa metabolic pathways, literature-supported associations among human disease states, and theory-supported insight into the evolutionary dynamics of duplicated genes in paleopolyploid plants. Conclusion We conclude that the ENTS classifier will be a valuable tool for the de novo annotation of genome

  11. Machine learning and hurdle models for improving regional predictions of stream water acid neutralizing capacity

    NASA Astrophysics Data System (ADS)

    Povak, Nicholas A.; Hessburg, Paul F.; Reynolds, Keith M.; Sullivan, Timothy J.; McDonnell, Todd C.; Salter, R. Brion

    2013-06-01

    In many industrialized regions of the world, atmospherically deposited sulfur derived from industrial, nonpoint air pollution sources reduces stream water quality and results in acidic conditions that threaten aquatic resources. Accurate maps of predicted stream water acidity are an essential aid to managers who must identify acid-sensitive streams, potentially affected biota, and create resource protection strategies. In this study, we developed correlative models to predict the acid neutralizing capacity (ANC) of streams across the southern Appalachian Mountain region, USA. Models were developed using stream water chemistry data from 933 sampled locations and continuous maps of pertinent environmental and climatic predictors. Environmental predictors were averaged across the upslope contributing area for each sampled stream location and submitted to both statistical and machine-learning regression models. Predictor variables represented key aspects of the contributing geology, soils, climate, topography, and acidic deposition. To reduce model error rates, we employed hurdle modeling to screen out well-buffered sites and predict continuous ANC for the remainder of the stream network. Models predicted acid-sensitive streams in forested watersheds with small contributing areas, siliceous lithologies, cool and moist environments, low clay content soils, and moderate or higher dry sulfur deposition. Our results confirmed findings from other studies and further identified several influential climatic variables and variable interactions. Model predictions indicated that one quarter of the total stream network was sensitive to additional sulfur inputs (i.e., ANC < 100 µeq L-1), while <10% displayed much lower ANC (<50 µeq L-1). These methods may be readily adapted in other regions to assess stream water quality and potential biotic sensitivity to acidic inputs.

  12. Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier.

    PubMed

    Dhole, Kaustubh; Singh, Gurdeep; Pai, Priyadarshini P; Mondal, Sukanta

    2014-05-01

    Protein-protein interactions are of central importance for virtually every process in a living cell. Information about the interaction sites in proteins improves our understanding of disease mechanisms and can provide the basis for new therapeutic approaches. Since a multitude of unique residue-residue contacts facilitate the interactions, protein-protein interaction sites prediction has become one of the most important and challenging problems of computational biology. Although much progress in this field has been reported, this problem is yet to be satisfactorily solved. Here, a novel method (LORIS: L1-regularized LOgistic Regression based protein-protein Interaction Sites predictor) is proposed, that identifies interaction residues, using sequence features and is implemented via the L1-logreg classifier. Results show that LORIS is not only quite effective, but also, performs better than existing state-of-the art methods. LORIS, available as standalone package, can be useful for facilitating drug-design and targeted mutation related studies, which require a deeper knowledge of protein interactions sites. PMID:24486250

  13. In Silico Prediction of Mutant HIV-1 Proteases Cleaving a Target Sequence

    PubMed Central

    Jensen, Jan H.; Willemoës, Martin; Winther, Jakob R.; De Vico, Luca

    2014-01-01

    HIV-1 protease represents an appealing system for directed enzyme re-design, since it has various different endogenous targets, a relatively simple structure and it is well studied. Recently Chaudhury and Gray (Structure (2009) 17: 1636–1648) published a computational algorithm to discern the specificity determining residues of HIV-1 protease. In this paper we present two computational tools aimed at re-designing HIV-1 protease, derived from the algorithm of Chaudhuri and Gray. First, we present an energy-only based methodology to discriminate cleavable and non cleavable peptides for HIV-1 proteases, both wild type and mutant. Secondly, we show an algorithm we developed to predict mutant HIV-1 proteases capable of cleaving a new target substrate peptide, different from the natural targets of HIV-1 protease. The obtained in silico mutant enzymes were analyzed in terms of cleavability and specificity towards the target peptide using the energy-only methodology. We found two mutant proteases as best candidates for specificity and cleavability towards the target sequence. PMID:24796579

  14. Full Genome Virus Detection in Fecal Samples Using Sensitive Nucleic Acid Preparation, Deep Sequencing, and a Novel Iterative Sequence Classification Algorithm

    PubMed Central

    Cotten, Matthew; Oude Munnink, Bas; Canuti, Marta; Deijs, Martin; Watson, Simon J.; Kellam, Paul; van der Hoek, Lia

    2014-01-01

    We have developed a full genome virus detection process that combines sensitive nucleic acid preparation optimised for virus identification in fecal material with Illumina MiSeq sequencing and a novel post-sequencing virus identification algorithm. Enriched viral nucleic acid was converted to double-stranded DNA and subjected to Illumina MiSeq sequencing. The resulting short reads were processed with a novel iterative Python algorithm SLIM for the identification of sequences with homology to known viruses. De novo assembly was then used to generate full viral genomes. The sensitivity of this process was demonstrated with a set of fecal samples from HIV-1 infected patients. A quantitative assessment of the mammalian, plant, and bacterial virus content of this compartment was generated and the deep sequencing data were sufficient to assembly 12 complete viral genomes from 6 virus families. The method detected high levels of enteropathic viruses that are normally controlled in healthy adults, but may be involved in the pathogenesis of HIV-1 infection and will provide a powerful tool for virus detection and for analyzing changes in the fecal virome associated with HIV-1 progression and pathogenesis. PMID:24695106

  15. The Effects of Demography and Long-Term Selection on the Accuracy of Genomic Prediction with Sequence Data

    PubMed Central

    MacLeod, Iona M.; Hayes, Ben J.; Goddard, Michael E.

    2014-01-01

    The use of dense SNPs to predict the genetic value of an individual for a complex trait is often referred to as “genomic selection” in livestock and crops, but is also relevant to human genetics to predict, for example, complex genetic disease risk. The accuracy of prediction depends on the strength of linkage disequilibrium (LD) between SNPs and causal mutations. If sequence data were used instead of dense SNPs, accuracy should increase because causal mutations are present, but demographic history and long-term negative selection also influence accuracy. We therefore evaluated genomic prediction, using simulated sequence in two contrasting populations: one reducing from an ancestrally large effective population size (Ne) to a small one, with high LD common in domestic livestock, while the second had a large constant-sized Ne with low LD similar to that in some human or outbred plant populations. There were two scenarios in each population; causal variants were either neutral or under long-term negative selection. For large Ne, sequence data led to a 22% increase in accuracy relative to ∼600K SNP chip data with a Bayesian analysis and a more modest advantage with a BLUP analysis. This advantage increased when causal variants were influenced by negative selection, and accuracy persisted when 10 generations separated reference and validation populations. However, in the reducing Ne population, there was little advantage for sequence even with negative selection. This study demonstrates the joint influence of demography and selection on accuracy of prediction and improves our understanding of how best to exploit sequence for genomic prediction. PMID:25233989

  16. Deformation history and load sequence effects on cumulative fatigue damage and life predictions

    NASA Astrophysics Data System (ADS)

    Colin, Julie

    Fatigue loading seldom involves constant amplitude loading. This is especially true in the cooling systems of nuclear power plants, typically made of stainless steel, where thermal fluctuations and water turbulent flow create variable amplitude loads, with presence of mean stresses and overloads. These complex loading sequences lead to the formation of networks of microcracks (crazing) that can propagate. As stainless steel is a material with strong deformation history effects and phase transformation resulting from plastic straining, such load sequence and variable amplitude loading effects are significant to its fatigue behavior and life predictions. The goal of this study was to investigate the effects of cyclic deformation on fatigue behavior of stainless steel 304L as a deformation history sensitive material and determine how to quantify and accumulate fatigue damage to enable life predictions under variable amplitude loading conditions for such materials. A comprehensive experimental program including testing under fully-reversed, as well as mean stress and/or mean strain conditions, with initial or periodic overloads, along with step testing and random loading histories was conducted on two grades of stainless steel 304L, under both strain-controlled and load-controlled conditions. To facilitate comparisons with a material without deformation history effects, similar tests were also carried out on aluminum 7075-T6. Experimental results are discussed, including peculiarities observed with stainless steel behavior, such as a phenomenon, referred to as secondary hardening characterized by a continuous increase in the stress response in a strain-controlled test and often leading to runout fatigue life. Possible mechanisms for secondary hardening observed in some tests are also discussed. The behavior of aluminum is shown not to be affected by preloading, whereas the behavior of stainless steel is greatly influenced by prior loading. Mean stress relaxation in

  17. Modeling and prediction of retardance in citric acid coated ferrofluid using artificial neural network

    NASA Astrophysics Data System (ADS)

    Lin, Jing-Fung; Sheu, Jer-Jia

    2016-06-01

    Citric acid coated (citrate-stabilized) magnetite (Fe3O4) magnetic nanoparticles have been conducted and applied in the biomedical fields. Using Taguchi-based measured retardances as the training data, an artificial neural network (ANN) model was developed for the prediction of retardance in citric acid (CA) coated ferrofluid (FF). According to the ANN simulation results in the training stage, the correlation coefficient between predicted retardances and measured retardances was found to be as high as 0.9999998. Based on the well-trained ANN model, the predicted retardance at excellent program from Taguchi method showed less error of 2.17% compared with a multiple regression (MR) analysis of statistical significance. Meanwhile, the parameter analysis at excellent program by the ANN model had the guiding significance to find out a possible program for the maximum retardance. It was concluded that the proposed ANN model had high ability for the prediction of retardance in CA coated FF.

  18. Evolutionary connections of biological kingdoms based on protein and nucleic acid sequence evidence

    NASA Technical Reports Server (NTRS)

    Dayhoff, M. O.

    1983-01-01

    Prokaryotic and eukaryotic evolutionary trees are developed from protein and nucleic-acid sequences by the methods of numerical taxonomy. Trees are presented for bacterial ferredoxins, 5S ribosomal RNA, c-type cytochromes , cytochromes c2 and c', and 5.8S ribosomal RNA; the implications for early evolution are discussed; and a composite tree showing the branching of the anaerobes, aerobes, archaebacteria, and eukaryotes is shown. Single lines are found for all oxygen-evolving photosynthetic forms and for the salt-loving and high-temperature forms of archaebacteria. It is argued that the eukaryote mitochondria, chloroplasts, and cytoplasmic host material are descended from free-living prokaryotes that formed symbiotic associations, with more than one symbiotic event involved in the evolution of each organelle.

  19. Prediction and Analysis of Quorum Sensing Peptides Based on Sequence Features

    PubMed Central

    Rajput, Akanksha; Gupta, Amit Kumar; Kumar, Manoj

    2015-01-01

    Quorum sensing peptides (QSPs) are the signaling molecules used by the Gram-positive bacteria in orchestrating cell-to-cell communication. In spite of their enormous importance in signaling process, their detailed bioinformatics analysis is lacking. In this study, QSPs and non-QSPs were examined according to their amino acid composition, residues position, motifs and physicochemical properties. Compositional analysis concludes that QSPs are enriched with aromatic residues like Trp, Tyr and Phe. At the N-terminal, Ser was a dominant residue at maximum positions, namely, first, second, third and fifth while Phe was a preferred residue at first, third and fifth positions from the C-terminal. A few motifs from QSPs were also extracted. Physicochemical properties like aromaticity, molecular weight and secondary structure were found to be distinguishing features of QSPs. Exploiting above properties, we have developed a Support Vector Machine (SVM) based predictive model. During 10-fold cross-validation, SVM achieves maximum accuracy of 93.00%, Mathew’s correlation coefficient (MCC) of 0.86 and Receiver operating characteristic (ROC) of 0.98 on the training/testing dataset (T200p+200n). Developed models performed equally well on the validation dataset (V20p+20n). The server also integrates several useful analysis tools like “QSMotifScan”, “ProtFrag”, “MutGen” and “PhysicoProp”. Our analysis reveals important characteristics of QSPs and on the basis of these unique features, we have developed a prediction algorithm “QSPpred” (freely available at: http://crdd.osdd.net/servers/qsppred). PMID:25781990

  20. Implicit learning of predictable sound sequences modulates human brain responses at different levels of the auditory hierarchy

    PubMed Central

    Lecaignard, Françoise; Bertrand, Olivier; Gimenez, Gérard; Mattout, Jérémie; Caclin, Anne

    2015-01-01

    Deviant stimuli, violating regularities in a sensory environment, elicit the mismatch negativity (MMN), largely described in the Event-Related Potential literature. While it is widely accepted that the MMN reflects more than basic change detection, a comprehensive description of mental processes modulating this response is still lacking. Within the framework of predictive coding, deviance processing is part of an inference process where prediction errors (the mismatch between incoming sensations and predictions established through experience) are minimized. In this view, the MMN is a measure of prediction error, which yields specific expectations regarding its modulations by various experimental factors. In particular, it predicts that the MMN should decrease as the occurrence of a deviance becomes more predictable. We conducted a passive oddball EEG study and manipulated the predictability of sound sequences by means of different temporal structures. Importantly, our design allows comparing mismatch responses elicited by predictable and unpredictable violations of a simple repetition rule and therefore departs from previous studies that investigate violations of different time-scale regularities. We observed a decrease of the MMN with predictability and interestingly, a similar effect at earlier latencies, within 70 ms after deviance onset. Following these pre-attentive responses, a reduced P3a was measured in the case of predictable deviants. We conclude that early and late deviance responses reflect prediction errors, triggering belief updating within the auditory hierarchy. Beside, in this passive study, such perceptual inference appears to be modulated by higher-level implicit learning of sequence statistical structures. Our findings argue for a hierarchical model of auditory processing where predictive coding enables implicit extraction of environmental regularities. PMID:26441602

  1. Implicit learning of predictable sound sequences modulates human brain responses at different levels of the auditory hierarchy.

    PubMed

    Lecaignard, Françoise; Bertrand, Olivier; Gimenez, Gérard; Mattout, Jérémie; Caclin, Anne

    2015-01-01

    Deviant stimuli, violating regularities in a sensory environment, elicit the mismatch negativity (MMN), largely described in the Event-Related Potential literature. While it is widely accepted that the MMN reflects more than basic change detection, a comprehensive description of mental processes modulating this response is still lacking. Within the framework of predictive coding, deviance processing is part of an inference process where prediction errors (the mismatch between incoming sensations and predictions established through experience) are minimized. In this view, the MMN is a measure of prediction error, which yields specific expectations regarding its modulations by various experimental factors. In particular, it predicts that the MMN should decrease as the occurrence of a deviance becomes more predictable. We conducted a passive oddball EEG study and manipulated the predictability of sound sequences by means of different temporal structures. Importantly, our design allows comparing mismatch responses elicited by predictable and unpredictable violations of a simple repetition rule and therefore departs from previous studies that investigate violations of different time-scale regularities. We observed a decrease of the MMN with predictability and interestingly, a similar effect at earlier latencies, within 70 ms after deviance onset. Following these pre-attentive responses, a reduced P3a was measured in the case of predictable deviants. We conclude that early and late deviance responses reflect prediction errors, triggering belief updating within the auditory hierarchy. Beside, in this passive study, such perceptual inference appears to be modulated by higher-level implicit learning of sequence statistical structures. Our findings argue for a hierarchical model of auditory processing where predictive coding enables implicit extraction of environmental regularities. PMID:26441602

  2. The amino acid alphabet and the architecture of the protein sequence-structure map. I. Binary alphabets.

    PubMed

    Ferrada, Evandro

    2014-12-01

    The correspondence between protein sequences and structures, or sequence-structure map, relates to fundamental aspects of structural, evolutionary and synthetic biology. The specifics of the mapping, such as the fraction of accessible sequences and structures, or the sequences' ability to fold fast, are dictated by the type of interactions between the monomers that compose the sequences. The set of possible interactions between monomers is encapsulated by the potential energy function. In this study, I explore the impact of the relative forces of the potential on the architecture of the sequence-structure map. My observations rely on simple exact models of proteins and random samples of the space of potential energy functions of binary alphabets. I adopt a graph perspective and study the distribution of viable sequences and the structures they produce, as networks of sequences connected by point mutations. I observe that the relative proportion of attractive, neutral and repulsive forces defines types of potentials, that induce sequence-structure maps of vastly different architectures. I characterize the properties underlying these differences and relate them to the structure of the potential. Among these properties are the expected number and relative distribution of sequences associated to specific structures and the diversity of structures as a function of sequence divergence. I study the types of binary potentials observed in natural amino acids and show that there is a strong bias towards only some types of potentials, a bias that seems to characterize the folding code of natural proteins. I discuss implications of these observations for the architecture of the sequence-structure map of natural proteins, the construction of random libraries of peptides, and the early evolution of the natural amino acid alphabet. PMID:25473967

  3. The Amino Acid Alphabet and the Architecture of the Protein Sequence-Structure Map. I. Binary Alphabets

    PubMed Central

    Ferrada, Evandro

    2014-01-01

    The correspondence between protein sequences and structures, or sequence-structure map, relates to fundamental aspects of structural, evolutionary and synthetic biology. The specifics of the mapping, such as the fraction of accessible sequences and structures, or the sequences' ability to fold fast, are dictated by the type of interactions between the monomers that compose the sequences. The set of possible interactions between monomers is encapsulated by the potential energy function. In this study, I explore the impact of the relative forces of the potential on the architecture of the sequence-structure map. My observations rely on simple exact models of proteins and random samples of the space of potential energy functions of binary alphabets. I adopt a graph perspective and study the distribution of viable sequences and the structures they produce, as networks of sequences connected by point mutations. I observe that the relative proportion of attractive, neutral and repulsive forces defines types of potentials, that induce sequence-structure maps of vastly different architectures. I characterize the properties underlying these differences and relate them to the structure of the potential. Among these properties are the expected number and relative distribution of sequences associated to specific structures and the diversity of structures as a function of sequence divergence. I study the types of binary potentials observed in natural amino acids and show that there is a strong bias towards only some types of potentials, a bias that seems to characterize the folding code of natural proteins. I discuss implications of these observations for the architecture of the sequence-structure map of natural proteins, the construction of random libraries of peptides, and the early evolution of the natural amino acid alphabet. PMID:25473967

  4. Trypsin inhibitors from ridged gourd (Luffa acutangula Linn.) seeds: purification, properties, and amino acid sequences.

    PubMed

    Haldar, U C; Saha, S K; Beavis, R C; Sinha, N K

    1996-02-01

    Two trypsin inhibitors, LA-1 and LA-2, have been isolated from ridged gourd (Luffa acutangula Linn.) seeds and purified to homogeneity by gel filtration followed by ion-exchange chromatography. The isoelectric point is at pH 4.55 for LA-1 and at pH 5.85 for LA-2. The Stokes radius of each inhibitor is 11.4 A. The fluorescence emission spectrum of each inhibitor is similar to that of the free tyrosine. The biomolecular rate constant of acrylamide quenching is 1.0 x 10(9) M-1 sec-1 for LA-1 and 0.8 x 10(9) M-1 sec-1 for LA-2 and that of K2HPO4 quenching is 1.6 x 10(11) M-1 sec-1 for LA-1 and 1.2 x 10(11) M-1 sec-1 for LA-2. Analysis of the circular dichroic spectra yields 40% alpha-helix and 60% beta-turn for La-1 and 45% alpha-helix and 55% beta-turn for LA-2. Inhibitors LA-1 and LA-2 consist of 28 and 29 amino acid residues, respectively. They lack threonine, alanine, valine, and tryptophan. Both inhibitors strongly inhibit trypsin by forming enzyme-inhibitor complexes at a molar ratio of unity. A chemical modification study suggests the involvement of arginine of LA-1 and lysine of LA-2 in their reactive sites. The inhibitors are very similar in their amino acid sequences, and show sequence homology with other squash family inhibitors. PMID:8924202

  5. Microfluidic platform for isolating nucleic acid targets using sequence specific hybridization

    PubMed Central

    Wang, Jingjing; Morabito, Kenneth; Tang, Jay X.; Tripathi, Anubhav

    2013-01-01

    The separation of target nucleic acid sequences from biological samples has emerged as a significant process in today's diagnostics and detection strategies. In addition to the possible clinical applications, the fundamental understanding of target and sequence specific hybridization on surface modified magnetic beads is of high value. In this paper, we describe a novel microfluidic platform that utilizes a mobile magnetic field in static microfluidic channels, where single stranded DNA (ssDNA) molecules are isolated via nucleic acid hybridization. We first established efficient isolation of biotinylated capture probe (BP) using streptavidin-coated magnetic beads. Subsequently, we investigated the hybridization of target ssDNA with BP bound to beads and explained these hybridization kinetics using a dual-species kinetic model. The number of hybridized target ssDNA molecules was determined to be about 6.5 times less than that of BP on the bead surface, due to steric hindrance effects. The hybridization of target ssDNA with non-complementary BP bound to bead was also examined, and non-specific hybridization was found to be insignificant. Finally, we demonstrated highly efficient capture and isolation of target ssDNA in the presence of non-target ssDNA, where as low as 1% target ssDNA can be detected from mixture. The microfluidic method described in this paper is significantly relevant and is broadly applicable, especially towards point-of-care biological diagnostic platforms that require binding and separation of known target biomolecules, such as RNA, ssDNA, or protein. PMID:24404041

  6. Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity.

    PubMed

    Philippe, Nicolas; Boureux, Anthony; Bréhélin, Laurent; Tarhio, Jorma; Commes, Thérèse; Rivals, Eric

    2009-08-01

    Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts. PMID:19531739

  7. Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

    PubMed Central

    Philippe, Nicolas; Boureux, Anthony; Bréhélin, Laurent; Tarhio, Jorma; Commes, Thérèse; Rivals, Éric

    2009-01-01

    Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts. PMID:19531739

  8. Characterization of N-glycosylation and amino acid sequence features of immunoglobulins from swine.

    PubMed

    Lopez, Paul G; Girard, Lauren; Buist, Marjorie; de Oliveira, Andrey Giovanni Gomes; Bodnar, Edward; Salama, Apolline; Soulillou, Jean-Paul; Perreault, Hélène

    2016-02-01

    The primary goal of this study was to develop a method to study the N-glycosylation of IgG from swine in order to detect epitopes containing N-glycolylneuraminic acid (Neu5Gc) and/or terminal galactose residues linked in α1-3 susceptible to cause xenograft-related problems. Samples of immunoglobulin were isolated from porcine serum using protein-A affinity chromatography. The eluate was then separated on electrophoretic gel, and bands corresponding to the N-glycosylated heavy chains were cut off the gel and subjected to tryptic digestion. Peptides and glycopeptides were separated by reversed phase liquid chromatography and fractions were collected for matrix-assisted laser desorption/ionization time-of-flight mass spectrometric (MALDI-TOF-MS) analysis. Overall no α1-3 galactose was detected, as demonstrated by complete susceptibility of terminal galactose residues to β-galactosidase digestion. Neu5Gc was detected on singly sialylated structures. Two major N-glycopeptides were found, EEQFNSTYR and EAQFNSTYR as determined by tandem MS (MS/MS), as previously reported by Butler et al. (Immunogenetics, 61, 2009, 209-230), who found 11 subclasses for porcine IgG. Out of the 11, ten include the sequence corresponding to EEQFNSTYR, and only one codes for EAQFNSTYR. In this study, glycosylation patterns associated with both chains were slightly different, in that EEQFNSTYR had a higher content of galactose. The last step of this study consisted of peptide-mapping the 11 reported porcine IgG sequences. Although there was considerable overlap, at least one unique tryptic peptide was found per IgG sequence. The workflow presented in this manuscript constitutes the first study to use MALDI-TOF-MS in the investigation of porcine IgG structural features. PMID:26586247

  9. Human Retroviruses and AIDS. A compilation and analysis of nucleic acid and amino acid sequences: I--II; III--V

    SciTech Connect

    Myers, G.; Korber, B.; Wain-Hobson, S.; Smith, R.F.; Pavlakis, G.N.

    1993-12-31

    This compendium and the accompanying floppy diskettes are the result of an effort to compile and rapidly publish all relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses. The scope of the compendium and database is best summarized by the five parts that it comprises: (I) HIV and SIV Nucleotide Sequences; (II) Amino Acid Sequences; (III) Analyses; (IV) Related Sequences; and (V) Database Communications. Information within all the parts is updated at least twice in each year, which accounts for the modes of binding and pagination in the compendium.

  10. Prediction of Coal ash leaching behavior in acid mine water, comparison of laboratory and field studies

    SciTech Connect

    ANNA, KNOX

    2005-01-10

    Strongly alkaline fluidized bed combustion ash is commonly used to control acid mine drainage in West Virginia coal mines. Objectives include acid neutralization and immobilization of the primary AMD pollutants: iron, aluminum and manganese. The process has been successful in controlling AMD though doubts remain regarding mobilization of other toxic elements present in the ash. In addition, AMD contains many toxic elements in low concentrations. And, each mine produces AMD of widely varying quality. So, predicting the effect of a particular ash on a given coal mine's drainage quality is of particular interest. In this chapter we compare the results of a site-specific ash leaching procedure with two large-scale field applications of FBC ash. The results suggested a high degree of predictability for roughly half of the 25 chemical parameters and poor predictability for the remainder. Of these, seven parameters were successfully predicted on both sites: acidity, Al, B, Ba, Fe, Ni and Zn while electrical conductivity, Ca, Cd, SO4, Pb and Sb were not successfully predicted on either site. Trends for the remaining elements: As, Ag, Be, Cu, Cr, Hg, Mg, Mn, pH, Se Tl and V were successfully predicted on one but not both mine sites.

  11. Predictive Value of 8 Genetic Loci for Serum Uric Acid Concentration

    PubMed Central

    Gunjača, Grgo; Boban, Mladen; Pehlić, Marina; Zemunik, Tatijana; Budimir, Danijela; Kolčić, Ivana; Lauc, Gordan; Rudan, Igor; Polašek, Ozren

    2010-01-01

    Aim To investigate the value of genomic information in prediction of individual serum uric acid concentrations. Methods Three population samples were investigated: from isolated Adriatic island communities of Vis (n = 980) and Korčula (n = 944), and from general population of the city of Split (n = 507). Serum uric acid concentration was correlated with the genetic risk score based on 8 previously described genes: PDZK1, GCKR, SLC2A9, ABCG2, LRRC16A, SLC17A3, SLC16A9, and SLC22A12, represented by a total of 16 single-nucleotide polymorphisms (SNP). The data were analyzed using classification and regression tree (CART) and general linear modeling. Results The most important variables for uric acid prediction with CART were genetic risk score in men and age in women. The percent of variance for any single SNP in predicting serum uric acid concentration varied from 0.0%-2.0%. The use of genetic risk score explained 0.1%-2.5% of uric acid variance in men and 3.9%-4.9% in women. The highest percent of variance was obtained when age, sex, and genetic risk score were used as predictors, with a total of 30.9% of variance in pooled analysis. Conclusion Despite overall low percent of explained variance, uric acid seems to be among the most predictive human quantitative traits based on the currently available SNP information. The use of genetic risk scores is a valuable approach in genetic epidemiology and increases the predictability of human quantitative traits based on genomic information compared with single SNP approach. PMID:20162742

  12. A framework for establishing predictive relationships between specific bacterial 16S rRNA sequence abundances and biotransformation rates.

    PubMed

    Helbling, Damian E; Johnson, David R; Lee, Tae Kwon; Scheidegger, Andreas; Fenner, Kathrin

    2015-03-01

    The rates at which wastewater treatment plant (WWTP) microbial communities biotransform specific substrates can differ by orders of magnitude among WWTP communities. Differences in taxonomic compositions among WWTP communities may predict differences in the rates of some types of biotransformations. In this work, we present a novel framework for establishing predictive relationships between specific bacterial 16S rRNA sequence abundances and biotransformation rates. We selected ten WWTPs with substantial variation in their environmental and operational metrics and measured the in situ ammonia biotransformation rate constants in nine of them. We isolated total RNA from samples from each WWTP and analyzed 16S rRNA sequence reads. We then developed multivariate models between the measured abundances of specific bacterial 16S rRNA sequence reads and the ammonia biotransformation rate constants. We constructed model scenarios that systematically explored the effects of model regularization, model linearity and non-linearity, and aggregation of 16S rRNA sequences into operational taxonomic units (OTUs) as a function of sequence dissimilarity threshold (SDT). A large percentage (greater than 80%) of model scenarios resulted in well-performing and significant models at intermediate SDTs of 0.13-0.14 and 0.26. The 16S rRNA sequences consistently selected into the well-performing and significant models at those SDTs were classified as Nitrosomonas and Nitrospira groups. We then extend the framework by applying it to the biotransformation rate constants of ten micropollutants measured in batch reactors seeded with the ten WWTP communities. We identified phylogenetic groups that were robustly selected into all well-performing and significant models constructed with biotransformation rates of isoproturon, propachlor, ranitidine, and venlafaxine. These phylogenetic groups can be used as predictive biomarkers of WWTP microbial community activity towards these specific

  13. Lactic acid production from potato peel waste by anaerobic sequencing batch fermentation using undefined mixed culture.

    PubMed

    Liang, Shaobo; McDonald, Armando G; Coats, Erik R

    2015-11-01

    Lactic acid (LA) is a necessary industrial feedstock for producing the bioplastic, polylactic acid (PLA), which is currently produced by pure culture fermentation of food carbohydrates. This work presents an alternative to produce LA from potato peel waste (PPW) by anaerobic fermentation in a sequencing batch reactor (SBR) inoculated with undefined mixed culture from a municipal wastewater treatment plant. A statistical design of experiments approach was employed using set of 0.8L SBRs using gelatinized PPW at a solids content range from 30 to 50 g L(-1), solids retention time of 2-4 days for yield and productivity optimization. The maximum LA production yield of 0.25 g g(-1) PPW and highest productivity of 125 mg g(-1) d(-1) were achieved. A scale-up SBR trial using neat gelatinized PPW (at 80 g L(-1) solids content) at the 3 L scale was employed and the highest LA yield of 0.14 g g(-1) PPW and a productivity of 138 mg g(-1) d(-1) were achieved with a 1 d SRT. PMID:25708409

  14. Bacterial community compositions in sediment polluted by perfluoroalkyl acids (PFAAs) using Illumina high-throughput sequencing.

    PubMed

    Sun, Yajun; Wang, Tieyu; Peng, Xiawei; Wang, Pei; Lu, Yonglong

    2016-06-01

    The characterization of bacterial community compositions and the change in perfluoroalkyl acids (PFAAs) along a natural river distribution system were explored in the present study. Illumina high-throughput sequencing was used to explore bacterial community diversity and structure in sediment polluted by PFAAs from the Xiaoqing River, the area with concentrated fluorochemical facilities in China. The concentration of PFAAs was in the range of 8.44-465.60 ng/g dry weight (dw) in sediment. Perfluorooctanoic acid (PFOA) was the dominant PFAA in all samples, which accounted for 94.2 % of total PFAAs. High-level PFOA could lead to an obvious increase in relative abundance of Proteobacteria, ε-Proteobacteria, Thiobacillus, and Sulfurimonas and the decrease in relative abundance of other bacteria. Redundancy analysis revealed that PFOA played an important role in the formation of bacterial community, and PFOA at higher concentration could reduce the diversity of bacterial community. When the concentration of PFOA was below 100 ng/g dw in sediment, no significant effect on microbial community structure was observed. Thiobacillus and Sulfurimonas were positively correlated with the concentration of PFOA, suggesting that both genera were resistant to PFOA contamination. PMID:26780047

  15. Mass spectrometric detection of the amino acid sequence polymorphism of the hepatitis C virus antigen.

    PubMed

    Kaysheva, A L; Ivanov, Yu D; Frantsuzov, P A; Krohin, N V; Pavlova, T I; Uchaikin, V F; Konev, V А; Kovalev, O B; Ziborov, V S; Archakov, A I

    2016-03-01

    A method for detection and identification of the hepatitis C virus antigen (HCVcoreAg) in human serum with consideration for possible amino acid substitutions is proposed. The method is based on a combination of biospecific capturing and concentrating of the target protein on the surface of the chip for atomic force microscope (AFM chip) with subsequent protein identification by tandem mass spectrometric (MS/MS) analysis. Biospecific AFM-capturing of viral particles containing HCVcoreAg from serum samples was performed by use of AFM chips with monoclonal antibodies (anti-HCVcore) covalently immobilized on the surface. Biospecific complexes were registered and counted by AFM. Further MS/MS analysis allowed to reliably identify the HCVcoreAg in the complexes formed on the AFM chip surface. Analysis of MS/MS spectra, with the account taken of the possible polymorphisms in the amino acid sequence of the HCVcoreAg, enabled us to increase the number of identified peptides. PMID:26773170

  16. Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information

    PubMed Central

    Pollastri, Gianluca; Martin, Alberto JM; Mooney, Catherine; Vullo, Alessandro

    2007-01-01

    Background Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio. Results Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available. Conclusion The predictive system are publicly available at the address . PMID:17570843

  17. Peptide sequencing by using a combination of partial acid hydrolysis and fast-atom-bombardment mass spectrometry.

    PubMed Central

    De Angelis, F; Botta, M; Ceccarelli, S; Nicoletti, R

    1986-01-01

    To overcome the limit of the intensity of ions carrying sequence information in structural determinations of peptides by fast-atom-bombardment m.s., we have developed a method that consists in taking spectra of the peptide acid hydrolysates at different hydrolysis times. Peaks correspond to the oligomers arising from the peptide partial hydrolysis. The sequence can then be identified from the structurally overlapping fragments. PMID:2428356

  18. Canine preprorelaxin: nucleic acid sequence and localization within the canine placenta.

    PubMed

    Klonisch, T; Hombach-Klonisch, S; Froehlich, C; Kauffold, J; Steger, K; Steinetz, B G; Fischer, B

    1999-03-01

    Employing uteroplacental tissue at Day 35 of gestation, we determined the nucleic acid sequence of canine preprorelaxin using reverse transcription- and rapid amplification of cDNA ends-polymerase chain reaction. Canine preprorelaxin cDNA consisted of 534 base pairs encoding a protein of 177 amino acids with a signal peptide of 25 amino acids (aa), a B domain of 35 aa, a C domain of 93 aa, and an A domain of 24 aa. The putative receptor binding region in the N'-terminal part of the canine relaxin B domain GRDYVR contained two substitutions from the classical motif (E-->D and L-->Y). Canine preprorelaxin shared highest homology with porcine and equine preprorelaxin. Northern analysis revealed a 1-kilobase transcript present in total RNA of canine uteroplacental tissue but not of kidney tissue. Uteroplacental tissue from two bitches each at Days 30 and 35 of gestation were studied by in situ hybridization to localize relaxin mRNA. Immunohistochemistry for relaxin, cytokeratin, vimentin, and von Willebrand factor was performed on uteroplacental tissue at Day 30 of gestation. The basal cell layer at the core of the chorionic villi was devoid of relaxin mRNA and immunoreactive relaxin or vimentin but was immunopositive for cytokeratin and identified as cytotrophoblast cells. The cell layer surrounding the chorionic villi displayed specific hybridization signals for relaxin mRNA and immunoreactivity for relaxin and cytokeratin but not for vimentin, and was identified as syncytiotrophoblast. Those areas of the chorioallantoic tissue with most intense relaxin immunoreactivity were highly vascularized as demonstrated by immunoreactive von Willebrand factor expressed on vascular endothelium. The uterine glands and nonplacental uterine areas of the canine zonary girdle placenta were devoid of relaxin mRNA and relaxin. We conclude that the syncytiotrophoblast is the source of relaxin in the canine placenta. PMID:10026098

  19. Purification and partial amino acid sequence of the chloroplast cytochrome b-559.

    PubMed

    Widger, W R; Cramer, W A; Hermodson, M; Meyer, D; Gullifor, M

    1984-03-25

    The hydrophobic cytochrome b-559, purified from unstacked, ethanol-washed spinach thylakoid membranes, using extraction with 2% Triton X-100 in 4 M urea and three chromatographic steps in the presence of protease inhibitors, has a dominant band on sodium dodecyl sulfate-urea gels corresponding to Mr = 10,000. The yield of this preparation is 30-50% (5-10 mg) starting with 600 mg of chlorophyll. The heme content yields a calculated molecular weight of no more than 17,500/heme, and perhaps somewhat smaller after correction for impurities. The Mr = 10,000 band is stained by the tetramethylbenzidine-H2O2 heme reagent on lithium dodecyl sulfate gels run at 0 degrees C. The Mr = 10,000 protein, further separated by high performance liquid chromatography, contains a unique NH2 terminus that is not blocked, and the amino acid sequence for the first 27 residues is NH2-Ser-Gly-Ser-Thr-Gly-Glu-Arg-Ser-Phe-Ala-Asp-Ile-Ile-Thr-Ser-Ile-Arg-Tyr-Trp -Val-Ile-X-Ser-Ile-Thr-Ile-Pro. . . COOH. Approximately 55% of the amino acids are hydrophobic, based on amino acid analysis of the Mr = 10,000 peptide, which also indicated the presence of at least one histidine. Only one cytochrome b-559 component could be identified, whose yield indicated that it arises from a single b-559 protein in chloroplasts corresponding to the in situ high potential cytochrome of the chloroplast photosystem II. PMID:6706983

  20. Sequence-Specific Electrical Purification of Nucleic Acids with Nanoporous Gold Electrodes.

    PubMed

    Daggumati, Pallavi; Appelt, Sandra; Matharu, Zimple; Marco, Maria L; Seker, Erkin

    2016-06-22

    Nucleic-acid-based biosensors have enabled rapid and sensitive detection of pathogenic targets; however, these devices often require purified nucleic acids for analysis since the constituents of complex biological fluids adversely affect sensor performance. This purification step is typically performed outside the device, thereby increasing sample-to-answer time and introducing contaminants. We report a novel approach using a multifunctional matrix, nanoporous gold (np-Au), which enables both detection of specific target sequences in a complex biological sample and their subsequent purification. The np-Au electrodes modified with 26-mer DNA probes (via thiol-gold chemistry) enabled sensitive detection and capture of complementary DNA targets in the presence of complex media (fetal bovine serum) and other interfering DNA fragments in the range of 50-1500 base pairs. Upon capture, the noncomplementary DNA fragments and serum constituents of varying sizes were washed away. Finally, the surface-bound DNA-DNA hybrids were released by electrochemically cleaving the thiol-gold linkage, and the hybrids were iontophoretically eluted from the nanoporous matrix. The optical and electrophoretic characterization of the analytes before and after the detection-purification process revealed that low target DNA concentrations (80 pg/μL) can be successfully detected in complex biological fluids and subsequently released to yield pure hybrids free of polydisperse digested DNA fragments and serum biomolecules. Taken together, this multifunctional platform is expected to enable seamless integration of detection and purification of nucleic acid biomarkers of pathogens and diseases in miniaturized diagnostic devices. PMID:27244455

  1. Multi-criterial coding sequence prediction. Combination of GeneMark with two novel, coding-character specific quantities.

    PubMed

    Almirantis, Yannis; Nikolaou, Christoforos

    2005-10-01

    This work applies two recently formulated quantities, strongly correlated with the coding character of a sequence, as an additional "module" on GeneMark, in a three-criterial method. The difference in the statistical approaches implicated by the methods combined here, is expected to contribute to an efficient assignment of functionality to unannotated genomic sequences. The developed combined algorithm is used to fractionalize a collection of GeneMark-predicted exons into sub-collections of different expectation to be coding. A further modification of the algorithm allows for the assignment of an improved estimation of the probability to be coding, to GeneMark-predicted exons. This is on the basis of a suitable training set of GeneMark-predicted exons of known functionality. PMID:15809100

  2. Negative Ion In-Source Decay Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry for Sequencing Acidic Peptides

    NASA Astrophysics Data System (ADS)

    McMillen, Chelsea L.; Wright, Patience M.; Cassady, Carolyn J.

    2016-05-01

    Matrix-assisted laser desorption/ionization (MALDI) in-source decay was studied in the negative ion mode on deprotonated peptides to determine its usefulness for obtaining extensive sequence information for acidic peptides. Eight biological acidic peptides, ranging in size from 11 to 33 residues, were studied by negative ion mode ISD (nISD). The matrices 2,5-dihydroxybenzoic acid, 2-aminobenzoic acid, 2-aminobenzamide, 1,5-diaminonaphthalene, 5-amino-1-naphthol, 3-aminoquinoline, and 9-aminoacridine were used with each peptide. Optimal fragmentation was produced with 1,5-diaminonphthalene (DAN), and extensive sequence informative fragmentation was observed for every peptide except hirudin(54-65). Cleavage at the N-Cα bond of the peptide backbone, producing c' and z' ions, was dominant for all peptides. Cleavage of the N-Cα bond N-terminal to proline residues was not observed. The formation of c and z ions is also found in electron transfer dissociation (ETD), electron capture dissociation (ECD), and positive ion mode ISD, which are considered to be radical-driven techniques. Oxidized insulin chain A, which has four highly acidic oxidized cysteine residues, had less extensive fragmentation. This peptide also exhibited the only charged localized fragmentation, with more pronounced product ion formation adjacent to the highly acidic residues. In addition, spectra were obtained by positive ion mode ISD for each protonated peptide; more sequence informative fragmentation was observed via nISD for all peptides. Three of the peptides studied had no product ion formation in ISD, but extensive sequence informative fragmentation was found in their nISD spectra. The results of this study indicate that nISD can be used to readily obtain sequence information for acidic peptides.

  3. Using electromagnetic induction technology to predict volatile fatty acid, source area differences

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Subsurface sampling techniques have been adapted to measure manure accumulation on feedlot surface. Objectives of this study were to determine if sensor data could be used to predict differences in volatile fatty acids (VFA) and other volatiles produced on the feedlot surface three days following a...

  4. Near-infrared (NIR) prediction of trans-fatty acids in ground cereal foods

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Near infrared (NIR) reflectance spectroscopy was evaluated as a rapid method for prediction of trans-fatty acid content in ground cereal products without the need for oil extraction. NIR spectra (400-2498 nm) of ground cereal products were obtained with a dispersive NIR spectrometer and correlated ...

  5. Homology analyses of the protein sequences of fatty acid synthases from chicken liver, rat mammary gland, and yeast

    SciTech Connect

    Chang, Soo-Ik ); Hammes, G.G. )

    1989-11-01

    Homology analyses of the protein sequences of chicken liver and rat mammary gland fatty acid synthases were carried out. The amino acid sequences of the chicken and rat enzymes are 67% identical. If conservative substitutions are allowed, 78% of the amino acids are matched. A region of low homologies exists between the functional domains, in particular around amino acid residues 1059-1264 of the chicken enzyme. Homologies between the active sites of chicken and rat and of chicken and yeast enzymes have been analyzed by an alignment method. A high degree of homology exists between the active sites of the chicken and rat enzymes. However, the chicken and yeast enzymes show a lower degree of homology. The DADPH-binding dinucleotide folds of the {beta}-ketoacyl reductase and the enoyl reductase sites were identified by comparison with a known consensus sequence for the DADP- and FAD-binding dinucleotide folds. The active sites of all of the enzymes are primarily in hydrophobic regions of the protein. This study suggests that the genes for the functional domains of fatty acid synthase were originally separated, and these genes were connected to each other by using different connecting nucleotide sequences in different species. An alternative explanation for the differences in rat and chicken is a common ancestry and mutations in the joining regions during evolution.

  6. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition.

    PubMed

    Chen, Yen-Kuang; Li, Kuo-Bin

    2013-02-01

    The type information of un-annotated membrane proteins provides an important hint for their biological functions. The experimental determination of membrane protein types, despite being more accurate and reliable, is not always feasible due to the costly laboratory procedures, thereby creating a need for the development of bioinformatics methods. This article describes a novel computational classifier for the prediction of membrane protein types using proteins' sequences. The classifier, comprising a collection of one-versus-one support vector machines, makes use of the following sequence attributes: (1) the cationic patch sizes, the orientation, and the topology of transmembrane segments; (2) the amino acid physicochemical properties; (3) the presence of signal peptides or anchors; and (4) the specific protein motifs. A new voting scheme was implemented to cope with the multi-class prediction. Both the training and the testing sequences were collected from SwissProt. Homologous proteins were removed such that there is no pair of sequences left in the datasets with a sequence identity higher than 40%. The performance of the classifier was evaluated by a Jackknife cross-validation and an independent testing experiments. Results show that the proposed classifier outperforms earlier predictors in prediction accuracy in seven of the eight membrane protein types. The overall accuracy was increased from 78.3% to 88.2%. Unlike earlier approaches which largely depend on position-specific substitution matrices and amino acid compositions, most of the sequence attributes implemented in the proposed classifier have supported literature evidences. The classifier has been deployed as a web server and can be accessed at http://bsaltools.ym.edu.tw/predmpt. PMID:23137835

  7. Fusion protein of the paramyxovirus simian virus 5: nucleotide sequence of mRNA predicts a highly hydrophobic glycoprotein.

    PubMed Central

    Paterson, R G; Harris, T J; Lamb, R A

    1984-01-01

    The nucleotide sequence of the mRNA coding for the fusion glycoprotein (F) of the paramyxovirus, simian virus 5, has been obtained. There is a single large open reading frame on the mRNA that encodes a protein of 529 amino acids with a molecular weight of 56,531. The proteolytic cleavage/activation site of F, to yield F2 and F1, contains five arginine residues. Six potential glycosylation sites were identified in the protein, two on F2 and four on F1. The deduced amino acid sequence indicates that F is extensively hydrophobic over the length of the polypeptide chain. Three regions are very hydrophobic and could interact directly with membranes: these are the NH2-terminal putative signal peptide, the COOH-terminal putative membrane anchorage domain, and the NH2-terminal region of F1. Images PMID:6093114

  8. Computational Framework for Prediction of Peptide Sequences That May Mediate Multiple Protein Interactions in Cancer-Associated Hub Proteins

    PubMed Central

    Sarkar, Debasree; Patra, Piya; Ghosh, Abhirupa; Saha, Sudipto

    2016-01-01

    A considerable proportion of protein-protein interactions (PPIs) in the cell are estimated to be mediated by very short peptide segments that approximately conform to specific sequence patterns known as linear motifs (LMs), often present in the disordered regions in the eukaryotic proteins. These peptides have been found to interact with low affinity and are able bind to multiple interactors, thus playing an important role in the PPI networks involving date hubs. In this work, PPI data and de novo motif identification based method (MEME) were used to identify such peptides in three cancer-associated hub proteins—MYC, APC and MDM2. The peptides corresponding to the significant LMs identified for each hub protein were aligned, the overlapping regions across these peptides being termed as overlapping linear peptides (OLPs). These OLPs were thus predicted to be responsible for multiple PPIs of the corresponding hub proteins and a scoring system was developed to rank them. We predicted six OLPs in MYC and five OLPs in MDM2 that scored higher than OLP predictions from randomly generated protein sets. Two OLP sequences from the C-terminal of MYC were predicted to bind with FBXW7, component of an E3 ubiquitin-protein ligase complex involved in proteasomal degradation of MYC. Similarly, we identified peptides in the C-terminal of MDM2 interacting with FKBP3, which has a specific role in auto-ubiquitinylation of MDM2. The peptide sequences predicted in MYC and MDM2 look promising for designing orthosteric inhibitors against possible disease-associated PPIs. Since these OLPs can interact with other proteins as well, these inhibitors should be specific to the targeted interactor to prevent undesired side-effects. This computational framework has been designed to predict and rank the peptide regions that may mediate multiple PPIs and can be applied to other disease-associated date hub proteins for prediction of novel therapeutic targets of small molecule PPI modulators. PMID

  9. The complete amino acid sequence of the A-chain of human plasma alpha 2HS-glycoprotein.

    PubMed

    Yoshioka, Y; Gejyo, F; Marti, T; Rickli, E E; Bürgi, W; Offner, G D; Troxler, R F; Schmid, K

    1986-02-01

    Normal human plasma alpha 2HS-glycoprotein has earlier been shown to be comprised of two polypeptide chains. Recently, the amino acid and carbohydrate sequences of the short chain were elucidated (Gejyo, F., Chang, J.-L., Bürgi, W., Schmid, K., Offner, G. D., Troxler, R.F., van Halbeck, H., Dorland, L., Gerwig, G. J., and Vliegenthart, J.F.G. (1983) J. Biol. Chem. 258, 4966-4971). In the present study, the amino acid sequence of the long chain of this protein, designated A-chain, was determined and found to consist of 282 amino acid residues. Twenty-four amino acid doublets were found; the most abundant of these are Pro-Pro and Ala-Ala which each occur five times. Of particular interest is the presence of three Gly-X-Pro and one Gly-Pro-X sequences that are characteristic of the repeating sequences of collagens. Chou-Fasman evaluation of the secondary structure suggested that the A-chain contains 29% alpha-helix, 24% beta-pleated sheet, and 26% reverse turns and, thus, approximately 80% of the polypeptide chain may display ordered structure. Four glycosylation sites were identified. The two N-glycosidic oligosaccharides were found in the center region (residues 138 and 158), whereas the two O-glycosidic heterosaccharides, both linked to threonine (residues 238 and 252), occur within the carboxyl-terminal region. The N-glycans are linked to Asn residues in beta-turns, while the O-glycans are located in short random segments. Comparison of the sequence of the amino- and carboxyl-terminal 30 residues with protein sequences in a data bank demonstrated that the A-chain is not significantly related to any known proteins. However, the proline-rich carboxyl-terminal region of the A-chain displays some sequence similarity to collagens and the collagen-like domains of complement subcomponent C1q. PMID:3944104

  10. Molecular cloning and expression of partial cDNAs and deduced amino acid sequence of a carboxyl-terminal fragment of human apolipoprotein B-100.

    PubMed Central

    Wei, C F; Chen, S H; Yang, C Y; Marcel, Y L; Milne, R W; Li, W H; Sparrow, J T; Gotto, A M; Chan, L

    1985-01-01

    Apolipoprotein (apo) B-100 cDNAs were identified in a human liver cDNA library cloned in the expression vector lambda gt11. The beta-galactosidase-apoB-100 fusion protein was detected by two independently produced low density lipoprotein polyclonal antisera and by three apoB-100 monoclonal antibodies that crossreact with apoB-74. It was not recognized by two apoB-100 monoclonal antibodies that crossreact with apoB-26. The longest clone, lambda B8, was completely sequenced. It contains a 2.8-kilobase DNA fragment containing the codons for the carboxyl-terminal 836 amino acid residues of apo-B-100, as well as the 3' untranslated region of apoB-100 mRNA. We have thus mapped apoB-74 to the carboxyl-terminal portion of apoB-100. The deduced amino acid sequence of the cloned DNA matches the sequences of 14 apoB-100 peptides determined in our laboratory. Minor differences in amino acid sequence were noted in three of the peptides, suggesting polymorphism of apoB-100 at the protein and DNA levels. Secondary structure predictions reveal an unusual pattern for apolipoproteins, consisting of beta-structure (24%), alpha-helical content (33%), and random structure (30%). Ten amphipathic helical regions of 10-24 residues were identified. This carboxyl-terminal fragment of apoB-100 is considerably more hydrophobic than other apolipoproteins with known structure. Its lipid binding regions might include stretches of highly hydrophobic beta-sheets as well as amphipathic helices. Our findings on apoB structure might be important for understanding the role of apoB-100-containing lipoproteins in atherosclerosis. PMID:2932736

  11. List of Predicted Simple Sequence Repeats from Sugar Beet GenBank Accessions

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Beta vulgaris ESTs from GenBank as of January 2005 collapsed into 13,618 unique clusters (4,023 Tentative Consensus sequences, 9,595 singletons), and 35% were contributed via work partially supported through the BSDF. These sequences were parsed through SSR-Primer software for discovering potential...

  12. cDNA-derived amino acid sequence of rat mitochondrial 3-oxoacyl-CoA thiolase with no transient presequence: structural relationship with peroxisomal isozyme.

    PubMed Central

    Arakawa, H; Takiguchi, M; Amaya, Y; Nagata, S; Hayashi, H; Mori, M

    1987-01-01

    The sorting of homologous proteins between two separate intracellular organelles is a major unsolved problem. 3-Oxoacyl-CoA thiolase is localized in mitochondria and peroxisomes, and provides a good system for the study on the problem. Unlike most mitochondrial matrix proteins, mitochondrial 3-oxoacyl-CoA thiolase in rats is synthesized with no transient presequence and possess information for mitochondrial targeting and import in the mature protein. Two overlapping cDNA clones contained an open reading frame encoding a polypeptide of 397 amino acid residues (predicted Mr = 41,868), a 5' untranslated sequence of 164 bp, a 3' untranslated sequence of 264 bp and a poly(A) tract. The amino acid sequence of the mitochondrial thiolase is 37% identical with that of the mature portion of rat peroxisomal 3-oxoacyl-CoA thiolase precursor. These results suggest that the two thiolases have a common origin and obtained information for targeting to respective organelles during evolution. Two portions in the mitochondrial thiolase that may serve as a mitochondrial targeting signal are presented. PMID:3038520

  13. Analysis of the functional domains of biosynthetic threonine deaminase by comparison of the amino acid sequences of three wild-type alleles to the amino acid sequence of biodegradative threonine deaminase.

    PubMed

    Taillon, B E; Little, R; Lawther, R P

    1988-03-31

    The nucleotide sequence of the gene, ilvA, for biosynthetic threonine deaminase (Tda) from Salmonella typhimurium was determined. The deduced amino acid sequence was compared with the deduced amino acid sequences of the biosynthetic Tda from Escherichia coli K-12 (ilvA) and Saccharomyces cerevisiae (ILV1) and the biodegradative Tda from E. coli K-12 (tdc). The comparison indicated the presence of two types of blocks of homologous amino acids. The first type of homology is in the N-terminal portion of all four isozymes of Tda and probably indicates amino acids involved in catalysis. The second type of homology is found in the C-terminal portion of the three biosynthetic isozymes and presumably is involved in either (i) the binding or interaction of the allosteric effector isoleucine with the enzyme, or (ii) subunit interactions. The sites of amino acid changes of two E. coli K-12 ilvA alleles with altered response to isoleucine are consistent with the conclusion that the C-terminal portion of biosynthetic Tda is involved in allosteric regulation. PMID:3290055

  14. Multiscale Reactive Molecular Dynamics for Absolute pK a Predictions and Amino Acid Deprotonation.

    PubMed

    Nelson, J Gard; Peng, Yuxing; Silverstein, Daniel W; Swanson, Jessica M J

    2014-07-01

    Accurately calculating a weak acid's pK a from simulations remains a challenging task. We report a multiscale theoretical approach to calculate the free energy profile for acid ionization, resulting in accurate absolute pK a values in addition to insights into the underlying mechanism. Importantly, our approach minimizes empiricism by mapping electronic structure data (QM/MM forces) into a reactive molecular dynamics model capable of extensive sampling. Consequently, the bulk property of interest (the absolute pK a) is the natural consequence of the model, not a parameter used to fit it. This approach is applied to create reactive models of aspartic and glutamic acids. We show that these models predict the correct pK a values and provide ample statistics to probe the molecular mechanism of dissociation. This analysis shows changes in the solvation structure and Zundel-dominated transitions between the protonated acid, contact ion pair, and bulk solvated excess proton. PMID:25061442

  15. Computational scheme for the prediction of metal ion binding by a soil fulvic acid

    USGS Publications Warehouse

    Marinsky, J.A.; Reddy, M.M.; Ephraim, J.H.; Mathuthu, A.S.

    1995-01-01

    The dissociation and metal ion binding properties of a soil fulvic acid have been characterized. Information thus gained was used to compensate for salt and site heterogeneity effects in metal ion complexation by the fulvic acid. An earlier computational scheme has been modified by incorporating an additional step which improves the accuracy of metal ion speciation estimates. An algorithm is employed for the prediction of metal ion binding by organic acid constituents of natural waters (once the organic acid is characterized in terms of functional group identity and abundance). The approach discussed here, currently used with a spreadsheet program on a personal computer, is conceptually envisaged to be compatible with computer programs available for ion binding by inorganic ligands in natural waters.

  16. Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition

    SciTech Connect

    Shen Hongbin; Chou Kuochen . E-mail: kchou@san.rr.com

    2005-11-25

    The nucleus is the brain of eukaryotic cells that guides the life processes of the cell by issuing key instructions. For in-depth understanding of the biochemical process of the nucleus, the knowledge of localization of nuclear proteins is very important. With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast annotating the subnuclear locations for numerous newly found nuclear protein sequences so as to be able to timely utilize them for basic research and drug discovery. In view of this, a novel approach is developed for predicting the protein subnuclear location. It is featured by introducing a powerful classifier, the optimized evidence-theoretic K-nearest classifier, and using the pseudo amino acid composition [K.C. Chou, PROTEINS: Structure, Function, and Genetics, 43 (2001) 246], which can incorporate a considerable amount of sequence-order effects, to represent protein samples. As a demonstration, identifications were performed for 370 nuclear proteins among the following 9 subnuclear locations: (1) Cajal body, (2) chromatin, (3) heterochromatin, (4) nuclear diffuse, (5) nuclear pore, (6) nuclear speckle, (7) nucleolus, (8) PcG body, and (9) PML body. The overall success rates thus obtained by both the re-substitution test and jackknife cross-validation test are significantly higher than those by existing classifiers on the same working dataset. It is anticipated that the powerful approach may also become a useful high throughput vehicle to bridge the huge gap occurring in the post-genomic era between the number of gene sequences in databases and the number of gene products that have been functionally characterized. The OET-KNN classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.

  17. The developmental transcriptome landscape of bovine skeletal muscle defined by Ribo-Zero ribonucleic acid sequencing.

    PubMed

    Sun, X; Li, M; Sun, Y; Cai, H; Li, R; Wei, X; Lan, X; Huang, Y; Lei, C; Chen, H

    2015-12-01

    Ribonucleic acid sequencing (RNA-Seq) libraries are normally prepared with oligo(dT) selection of poly(A)+ mRNA, but it depends on intact total RNA samples. Recent studies have described Ribo-Zero technology, a novel method that can capture both poly(A)+ and poly(A)- transcripts from intact or fragmented RNA samples. We report here the first application of Ribo-Zero RNA-Seq for the analysis of the bovine embryonic, neonatal, and adult skeletal muscle whole transcriptome at an unprecedented depth. Overall, 19,893 genes were found to be expressed, with a high correlation of expression levels between the calf and the adult. Hundreds of genes were found to be highly expressed in the embryo and decreased at least 10-fold after birth, indicating their potential roles in embryonic muscle development. In addition, we present for the first time the analysis of global transcript isoform discovery in bovine skeletal muscle and identified 36,694 transcript isoforms. Transcriptomic data were also analyzed to unravel sequence variations; 185,036 putative SNP and 12,428 putative short insertions-deletions (InDel) were detected. Specifically, many stop-gain, stop-loss, and frameshift mutations were identified that probably change the relative protein production and sequentially affect the gene function. Notably, the numbers of stage-specific transcripts, alternative splicing events, SNP, and InDel were greater in the embryo than in the calf and the adult, suggesting that gene expression is most active in the embryo. The resulting view of the transcriptome at a single-base resolution greatly enhances the comprehensive transcript catalog and uncovers the global trends in gene expression during bovine skeletal muscle development. PMID:26641174

  18. Method for the detection of specific nucleic acid sequences by polymerase nucleotide incorporation

    DOEpatents

    Castro, Alonso

    2004-06-01

    A method for rapid and efficient detection of a target DNA or RNA sequence is provided. A primer having a 3'-hydroxyl group at one end and having a sequence of nucleotides sufficiently homologous with an identifying sequence of nucleotides in the target DNA is selected. The primer is hybridized to the identifying sequence of nucleotides on the DNA or RNA sequence and a reporter molecule is synthesized on the target sequence by progressively binding complementary nucleotides to the primer, where the complementary nucleotides include nucleotides labeled with a fluorophore. Fluorescence emitted by fluorophores on single reporter molecules is detected to identify the target DNA or RNA sequence.

  19. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

    PubMed

    Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

    2016-01-01

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. PMID:26590254

  20. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures

    PubMed Central

    Lua, Rhonald C.; Wilson, Stephen J.; Konecki, Daniel M.; Wilkins, Angela D.; Venner, Eric; Morgan, Daniel H.; Lichtarge, Olivier

    2016-01-01

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. PMID:26590254

  1. Characterization and cDNA sequence of Bothriechis schlegeliil-amino acid oxidase with antibacterial activity.

    PubMed

    Vargas Muñoz, Leidy Johana; Estrada-Gomez, Sebastian; Núñez, Vitelbina; Sanz, Libia; Calvete, Juan J

    2014-08-01

    Snake venoms are complex mixtures of proteins including l-amino acid oxidase (lAAO). A lAAO (named BslAAO) with a mass of 56kDa and a theoretical Ip of 5.79, was purified from Bothriechis schlegelii venom through size-exclusion, ion exchange and affinity chroma