cOSPREY: A Cloud-Based Distributed Algorithm for Large-Scale Computational Protein Design
Pan, Yuchao; Dong, Yuxi; Zhou, Jingtian; Hallen, Mark; Donald, Bruce R.; Xu, Wei
2016-01-01
Abstract Finding the global minimum energy conformation (GMEC) of a huge combinatorial search space is the key challenge in computational protein design (CPD) problems. Traditional algorithms lack a scalable and efficient distributed design scheme, preventing researchers from taking full advantage of current cloud infrastructures. We design cloud OSPREY (cOSPREY), an extension to a widely used protein design software OSPREY, to allow the original design framework to scale to the commercial cloud infrastructures. We propose several novel designs to integrate both algorithm and system optimizations, such as GMEC-specific pruning, state search partitioning, asynchronous algorithm state sharing, and fault tolerance. We evaluate cOSPREY on three different cloud platforms using different technologies and show that it can solve a number of large-scale protein design problems that have not been possible with previous approaches. PMID:27154509
Parallel Computational Protein Design.
Zhou, Yichao; Donald, Bruce R; Zeng, Jianyang
2017-01-01
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab (Gainza et al., Methods Enzymol 523:87, 2013) to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE (Gainza et al., PLoS Comput Biol 8:e1002335, 2012) and DEEPer (Hallen et al., Proteins 81:18-39, 2013) to also consider continuous backbone and side-chain flexibility.
SHARPEN-systematic hierarchical algorithms for rotamers and proteins on an extended network.
Loksha, Ilya V; Maiolo, James R; Hong, Cheng W; Ng, Albert; Snow, Christopher D
2009-04-30
Algorithms for discrete optimization of proteins play a central role in recent advances in protein structure prediction and design. We wish to improve the resources available for computational biologists to rapidly prototype such algorithms and to easily scale these algorithms to many processors. To that end, we describe the implementation and use of two new open source resources, citing potential benefits over existing software. We discuss CHOMP, a new object-oriented library for macromolecular optimization, and SHARPEN, a framework for scaling CHOMP scripts to many computers. These tools allow users to develop new algorithms for a variety of applications including protein repacking, protein-protein docking, loop rebuilding, or homology model remediation. Particular care was taken to allow modular energy function design; protein conformations may currently be scored using either the OPLSaa molecular mechanical energy function or an all-atom semiempirical energy function employed by Rosetta. (c) 2009 Wiley Periodicals, Inc.
Energy design for protein-protein interactions
Ravikant, D. V. S.; Elber, Ron
2011-01-01
Proteins bind to other proteins efficiently and specifically to carry on many cell functions such as signaling, activation, transport, enzymatic reactions, and more. To determine the geometry and strength of binding of a protein pair, an energy function is required. An algorithm to design an optimal energy function, based on empirical data of protein complexes, is proposed and applied. Emphasis is made on negative design in which incorrect geometries are presented to the algorithm that learns to avoid them. For the docking problem the search for plausible geometries can be performed exhaustively. The possible geometries of the complex are generated on a grid with the help of a fast Fourier transform algorithm. A novel formulation of negative design makes it possible to investigate iteratively hundreds of millions of negative examples while monotonically improving the quality of the potential. Experimental structures for 640 protein complexes are used to generate positive and negative examples for learning parameters. The algorithm designed in this work finds the correct binding structure as the lowest energy minimum in 318 cases of the 640 examples. Further benchmarks on independent sets confirm the significant capacity of the scoring function to recognize correct modes of interactions. PMID:21842951
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
OSPREY: protein design with ensembles, flexibility, and provable algorithms.
Gainza, Pablo; Roberts, Kyle E; Georgiev, Ivelin; Lilien, Ryan H; Keedy, Daniel A; Chen, Cheng-Yu; Reza, Faisal; Anderson, Amy C; Richardson, David C; Richardson, Jane S; Donald, Bruce R
2013-01-01
We have developed a suite of protein redesign algorithms that improves realistic in silico modeling of proteins. These algorithms are based on three characteristics that make them unique: (1) improved flexibility of the protein backbone, protein side-chains, and ligand to accurately capture the conformational changes that are induced by mutations to the protein sequence; (2) modeling of proteins and ligands as ensembles of low-energy structures to better approximate binding affinity; and (3) a globally optimal protein design search, guaranteeing that the computational predictions are optimal with respect to the input model. Here, we illustrate the importance of these three characteristics. We then describe OSPREY, a protein redesign suite that implements our protein design algorithms. OSPREY has been used prospectively, with experimental validation, in several biomedically relevant settings. We show in detail how OSPREY has been used to predict resistance mutations and explain why improved flexibility, ensembles, and provability are essential for this application. OSPREY is free and open source under a Lesser GPL license. The latest version is OSPREY 2.0. The program, user manual, and source code are available at www.cs.duke.edu/donaldlab/software.php. osprey@cs.duke.edu. Copyright © 2013 Elsevier Inc. All rights reserved.
Jou, Jonathan D; Jain, Swati; Georgiev, Ivelin S; Donald, Bruce R
2016-06-01
Sparse energy functions that ignore long range interactions between residue pairs are frequently used by protein design algorithms to reduce computational cost. Current dynamic programming algorithms that fully exploit the optimal substructure produced by these energy functions only compute the GMEC. This disproportionately favors the sequence of a single, static conformation and overlooks better binding sequences with multiple low-energy conformations. Provable, ensemble-based algorithms such as A* avoid this problem, but A* cannot guarantee better performance than exhaustive enumeration. We propose a novel, provable, dynamic programming algorithm called Branch-Width Minimization* (BWM*) to enumerate a gap-free ensemble of conformations in order of increasing energy. Given a branch-decomposition of branch-width w for an n-residue protein design with at most q discrete side-chain conformations per residue, BWM* returns the sparse GMEC in O([Formula: see text]) time and enumerates each additional conformation in merely O([Formula: see text]) time. We define a new measure, Total Effective Search Space (TESS), which can be computed efficiently a priori before BWM* or A* is run. We ran BWM* on 67 protein design problems and found that TESS discriminated between BWM*-efficient and A*-efficient cases with 100% accuracy. As predicted by TESS and validated experimentally, BWM* outperforms A* in 73% of the cases and computes the full ensemble or a close approximation faster than A*, enumerating each additional conformation in milliseconds. Unlike A*, the performance of BWM* can be predicted in polynomial time before running the algorithm, which gives protein designers the power to choose the most efficient algorithm for their particular design problem.
Sevy, Alexander M.; Jacobs, Tim M.; Crowe, James E.; Meiler, Jens
2015-01-01
Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a ‘single state’ design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design “promiscuous”, polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes. PMID:26147100
3D Protein structure prediction with genetic tabu search algorithm
2010-01-01
Background Protein structure prediction (PSP) has important applications in different fields, such as drug design, disease prediction, and so on. In protein structure prediction, there are two important issues. The first one is the design of the structure model and the second one is the design of the optimization technology. Because of the complexity of the realistic protein structure, the structure model adopted in this paper is a simplified model, which is called off-lattice AB model. After the structure model is assumed, optimization technology is needed for searching the best conformation of a protein sequence based on the assumed structure model. However, PSP is an NP-hard problem even if the simplest model is assumed. Thus, many algorithms have been developed to solve the global optimization problem. In this paper, a hybrid algorithm, which combines genetic algorithm (GA) and tabu search (TS) algorithm, is developed to complete this task. Results In order to develop an efficient optimization algorithm, several improved strategies are developed for the proposed genetic tabu search algorithm. The combined use of these strategies can improve the efficiency of the algorithm. In these strategies, tabu search introduced into the crossover and mutation operators can improve the local search capability, the adoption of variable population size strategy can maintain the diversity of the population, and the ranking selection strategy can improve the possibility of an individual with low energy value entering into next generation. Experiments are performed with Fibonacci sequences and real protein sequences. Experimental results show that the lowest energy obtained by the proposed GATS algorithm is lower than that obtained by previous methods. Conclusions The hybrid algorithm has the advantages from both genetic algorithm and tabu search algorithm. It makes use of the advantage of multiple search points in genetic algorithm, and can overcome poor hill-climbing capability in the conventional genetic algorithm by using the flexible memory functions of TS. Compared with some previous algorithms, GATS algorithm has better performance in global optimization and can predict 3D protein structure more effectively. PMID:20522256
A critical analysis of computational protein design with sparse residue interaction graphs
Georgiev, Ivelin S.
2017-01-01
Protein design algorithms enumerate a combinatorial number of candidate structures to compute the Global Minimum Energy Conformation (GMEC). To efficiently find the GMEC, protein design algorithms must methodically reduce the conformational search space. By applying distance and energy cutoffs, the protein system to be designed can thus be represented using a sparse residue interaction graph, where the number of interacting residue pairs is less than all pairs of mutable residues, and the corresponding GMEC is called the sparse GMEC. However, ignoring some pairwise residue interactions can lead to a change in the energy, conformation, or sequence of the sparse GMEC vs. the original or the full GMEC. Despite the widespread use of sparse residue interaction graphs in protein design, the above mentioned effects of their use have not been previously analyzed. To analyze the costs and benefits of designing with sparse residue interaction graphs, we computed the GMECs for 136 different protein design problems both with and without distance and energy cutoffs, and compared their energies, conformations, and sequences. Our analysis shows that the differences between the GMECs depend critically on whether or not the design includes core, boundary, or surface residues. Moreover, neglecting long-range interactions can alter local interactions and introduce large sequence differences, both of which can result in significant structural and functional changes. Designs on proteins with experimentally measured thermostability show it is beneficial to compute both the full and the sparse GMEC accurately and efficiently. To this end, we show that a provable, ensemble-based algorithm can efficiently compute both GMECs by enumerating a small number of conformations, usually fewer than 1000. This provides a novel way to combine sparse residue interaction graphs with provable, ensemble-based algorithms to reap the benefits of sparse residue interaction graphs while avoiding their potential inaccuracies. PMID:28358804
Yeh, Chun-Ting; Brunette, T J; Baker, David; McIntosh-Smith, Simon; Parmeggiani, Fabio
2018-02-01
Computational protein design methods have enabled the design of novel protein structures, but they are often still limited to small proteins and symmetric systems. To expand the size of designable proteins while controlling the overall structure, we developed Elfin, a genetic algorithm for the design of novel proteins with custom shapes using structural building blocks derived from experimentally verified repeat proteins. By combining building blocks with compatible interfaces, it is possible to rapidly build non-symmetric large structures (>1000 amino acids) that match three-dimensional geometric descriptions provided by the user. A run time of about 20min on a laptop computer for a 3000 amino acid structure makes Elfin accessible to users with limited computational resources. Protein structures with controlled geometry will allow the systematic study of the effect of spatial arrangement of enzymes and signaling molecules, and provide new scaffolds for functional nanomaterials. Copyright © 2017 Elsevier Inc. All rights reserved.
Projections for fast protein structure retrieval
Bhattacharya, Sourangshu; Bhattacharyya, Chiranjib; Chandra, Nagasuma R
2006-01-01
Background In recent times, there has been an exponential rise in the number of protein structures in databases e.g. PDB. So, design of fast algorithms capable of querying such databases is becoming an increasingly important research issue. This paper reports an algorithm, motivated from spectral graph matching techniques, for retrieving protein structures similar to a query structure from a large protein structure database. Each protein structure is specified by the 3D coordinates of residues of the protein. The algorithm is based on a novel characterization of the residues, called projections, leading to a similarity measure between the residues of the two proteins. This measure is exploited to efficiently compute the optimal equivalences. Results Experimental results show that, the current algorithm outperforms the state of the art on benchmark datasets in terms of speed without losing accuracy. Search results on SCOP 95% nonredundant database, for fold similarity with 5 proteins from different SCOP classes show that the current method performs competitively with the standard algorithm CE. The algorithm is also capable of detecting non-topological similarities between two proteins which is not possible with most of the state of the art tools like Dali. PMID:17254310
Minimalist ensemble algorithms for genome-wide protein localization prediction.
Lin, Jhih-Rong; Mondal, Ananda Mohan; Liu, Rong; Hu, Jianjun
2012-07-03
Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi.
Minimalist ensemble algorithms for genome-wide protein localization prediction
2012-01-01
Background Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. Results This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. Conclusions We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi. PMID:22759391
Lim, Hansaim; Gray, Paul; Xie, Lei; Poleksic, Aleksandar
2016-01-01
Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design. PMID:27958331
Lim, Hansaim; Gray, Paul; Xie, Lei; Poleksic, Aleksandar
2016-12-13
Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design.
Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria
Farasat, Iman; Kushwaha, Manish; Collens, Jason; Easterbrook, Michael; Guido, Matthew; Salis, Howard M
2014-01-01
Developing predictive models of multi-protein genetic systems to understand and optimize their behavior remains a combinatorial challenge, particularly when measurement throughput is limited. We developed a computational approach to build predictive models and identify optimal sequences and expression levels, while circumventing combinatorial explosion. Maximally informative genetic system variants were first designed by the RBS Library Calculator, an algorithm to design sequences for efficiently searching a multi-protein expression space across a > 10,000-fold range with tailored search parameters and well-predicted translation rates. We validated the algorithm's predictions by characterizing 646 genetic system variants, encoded in plasmids and genomes, expressed in six gram-positive and gram-negative bacterial hosts. We then combined the search algorithm with system-level kinetic modeling, requiring the construction and characterization of 73 variants to build a sequence-expression-activity map (SEAMAP) for a biosynthesis pathway. Using model predictions, we designed and characterized 47 additional pathway variants to navigate its activity space, find optimal expression regions with desired activity response curves, and relieve rate-limiting steps in metabolism. Creating sequence-expression-activity maps accelerates the optimization of many protein systems and allows previous measurements to quantitatively inform future designs. PMID:24952589
Zhou, Ruhong
2004-05-01
A highly parallel replica exchange method (REM) that couples with a newly developed molecular dynamics algorithm particle-particle particle-mesh Ewald (P3ME)/RESPA has been proposed for efficient sampling of protein folding free energy landscape. The algorithm is then applied to two separate protein systems, beta-hairpin and a designed protein Trp-cage. The all-atom OPLSAA force field with an explicit solvent model is used for both protein folding simulations. Up to 64 replicas of solvated protein systems are simulated in parallel over a wide range of temperatures. The combined trajectories in temperature and configurational space allow a replica to overcome free energy barriers present at low temperatures. These large scale simulations reveal detailed results on folding mechanisms, intermediate state structures, thermodynamic properties and the temperature dependences for both protein systems.
High performance transcription factor-DNA docking with GPU computing
2012-01-01
Background Protein-DNA docking is a very challenging problem in structural bioinformatics and has important implications in a number of applications, such as structure-based prediction of transcription factor binding sites and rational drug design. Protein-DNA docking is very computational demanding due to the high cost of energy calculation and the statistical nature of conformational sampling algorithms. More importantly, experiments show that the docking quality depends on the coverage of the conformational sampling space. It is therefore desirable to accelerate the computation of the docking algorithm, not only to reduce computing time, but also to improve docking quality. Methods In an attempt to accelerate the sampling process and to improve the docking performance, we developed a graphics processing unit (GPU)-based protein-DNA docking algorithm. The algorithm employs a potential-based energy function to describe the binding affinity of a protein-DNA pair, and integrates Monte-Carlo simulation and a simulated annealing method to search through the conformational space. Algorithmic techniques were developed to improve the computation efficiency and scalability on GPU-based high performance computing systems. Results The effectiveness of our approach is tested on a non-redundant set of 75 TF-DNA complexes and a newly developed TF-DNA docking benchmark. We demonstrated that the GPU-based docking algorithm can significantly accelerate the simulation process and thereby improving the chance of finding near-native TF-DNA complex structures. This study also suggests that further improvement in protein-DNA docking research would require efforts from two integral aspects: improvement in computation efficiency and energy function design. Conclusions We present a high performance computing approach for improving the prediction accuracy of protein-DNA docking. The GPU-based docking algorithm accelerates the search of the conformational space and thus increases the chance of finding more near-native structures. To the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem. PMID:22759575
Generic framework for mining cellular automata models on protein-folding simulations.
Diaz, N; Tischer, I
2016-05-13
Cellular automata model identification is an important way of building simplified simulation models. In this study, we describe a generic architectural framework to ease the development process of new metaheuristic-based algorithms for cellular automata model identification in protein-folding trajectories. Our framework was developed by a methodology based on design patterns that allow an improved experience for new algorithms development. The usefulness of the proposed framework is demonstrated by the implementation of four algorithms, able to obtain extremely precise cellular automata models of the protein-folding process with a protein contact map representation. Dynamic rules obtained by the proposed approach are discussed, and future use for the new tool is outlined.
A High Performance Cloud-Based Protein-Ligand Docking Prediction Algorithm
Chen, Jui-Le; Yang, Chu-Sing
2013-01-01
The potential of predicting druggability for a particular disease by integrating biological and computer science technologies has witnessed success in recent years. Although the computer science technologies can be used to reduce the costs of the pharmaceutical research, the computation time of the structure-based protein-ligand docking prediction is still unsatisfied until now. Hence, in this paper, a novel docking prediction algorithm, named fast cloud-based protein-ligand docking prediction algorithm (FCPLDPA), is presented to accelerate the docking prediction algorithm. The proposed algorithm works by leveraging two high-performance operators: (1) the novel migration (information exchange) operator is designed specially for cloud-based environments to reduce the computation time; (2) the efficient operator is aimed at filtering out the worst search directions. Our simulation results illustrate that the proposed method outperforms the other docking algorithms compared in this paper in terms of both the computation time and the quality of the end result. PMID:23762864
Computational protein design with backbone plasticity
MacDonald, James T.; Freemont, Paul S.
2016-01-01
The computational algorithms used in the design of artificial proteins have become increasingly sophisticated in recent years, producing a series of remarkable successes. The most dramatic of these is the de novo design of artificial enzymes. The majority of these designs have reused naturally occurring protein structures as ‘scaffolds’ onto which novel functionality can be grafted without having to redesign the backbone structure. The incorporation of backbone flexibility into protein design is a much more computationally challenging problem due to the greatly increased search space, but promises to remove the limitations of reusing natural protein scaffolds. In this review, we outline the principles of computational protein design methods and discuss recent efforts to consider backbone plasticity in the design process. PMID:27911735
Protein Structure Prediction with Evolutionary Algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hart, W.E.; Krasnogor, N.; Pelta, D.A.
1999-02-08
Evolutionary algorithms have been successfully applied to a variety of molecular structure prediction problems. In this paper we reconsider the design of genetic algorithms that have been applied to a simple protein structure prediction problem. Our analysis considers the impact of several algorithmic factors for this problem: the confirmational representation, the energy formulation and the way in which infeasible conformations are penalized, Further we empirically evaluated the impact of these factors on a small set of polymer sequences. Our analysis leads to specific recommendations for both GAs as well as other heuristic methods for solving PSP on the HP model.
Ghirlanda, G; Lear, J D; Lombardi, A; DeGrado, W F
1998-08-14
A series of synthetic receptors capable of binding to the calmodulin-binding domain of calcineurin (CN393-414) was designed, synthesized and characterized. The design was accomplished by docking CN393-414 against a two-helix receptor, using an idealized three-stranded coiled coil as a starting geometry. The sequence of the receptor was chosen using a side-chain re-packing program, which employed a genetic algorithm to select potential binders from a total of 7.5x10(6) possible sequences. A total of 25 receptors were prepared, representing 13 sequences predicted by the algorithm as well as 12 related sequences that were not predicted. The receptors were characterized by CD spectroscopy, analytical ultracentrifugation, and binding assays. The receptors predicted by the algorithm bound CN393-414 with apparent dissociation constants ranging from 0.2 microM to >50 microM. Many of the receptors that were not predicted by the algorithm also bound to CN393-414. Methods to circumvent this problem and to improve the automated design of functional proteins are discussed. Copyright 1998 Academic Press
Petitjean, Michel
2017-10-01
Some major proteins families, such as carbonic anhydrases (CAs), have a conical cavity at the active site. No algorithm was available to compute conical cavities, so we needed to design one. The fast algorithm we designed let us show on a set of 717 CAs extracted from the PDB database that γ-CAs are characterized by active site cavity cone angles significantly larger than those of α-CAs and β-CAs: the generatrix-axis angles are greater than 60° for the γ-CAs while they are smaller than 50° for the other CAs. Free binaries of the CONICA software implementing the algorithm are available through a software repository at http://petitjeanmichel.free.fr/itoweb.petitjean.freeware.html. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Computational approaches for rational design of proteins with novel functionalities
Tiwari, Manish Kumar; Singh, Ranjitha; Singh, Raushan Kumar; Kim, In-Won; Lee, Jung-Kul
2012-01-01
Proteins are the most multifaceted macromolecules in living systems and have various important functions, including structural, catalytic, sensory, and regulatory functions. Rational design of enzymes is a great challenge to our understanding of protein structure and physical chemistry and has numerous potential applications. Protein design algorithms have been applied to design or engineer proteins that fold, fold faster, catalyze, catalyze faster, signal, and adopt preferred conformational states. The field of de novo protein design, although only a few decades old, is beginning to produce exciting results. Developments in this field are already having a significant impact on biotechnology and chemical biology. The application of powerful computational methods for functional protein designing has recently succeeded at engineering target activities. Here, we review recently reported de novo functional proteins that were developed using various protein design approaches, including rational design, computational optimization, and selection from combinatorial libraries, highlighting recent advances and successes. PMID:24688643
Identification of Conserved Water Sites in Protein Structures for Drug Design.
Jukič, Marko; Konc, Janez; Gobec, Stanislav; Janežič, Dušanka
2017-12-26
Identification of conserved waters in protein structures is a challenging task with applications in molecular docking and protein stability prediction. As an alternative to computationally demanding simulations of proteins in water, experimental cocrystallized waters in the Protein Data Bank (PDB) in combination with a local structure alignment algorithm can be used for reliable prediction of conserved water sites. We developed the ProBiS H2O approach based on the previously developed ProBiS algorithm, which enables identification of conserved water sites in proteins using experimental protein structures from the PDB or a set of custom protein structures available to the user. With a protein structure, a binding site, or an individual water molecule as a query, ProBiS H2O collects similar proteins from the PDB and performs local or binding site-specific superimpositions of the query structure with similar proteins using the ProBiS algorithm. It collects the experimental water molecules from the similar proteins and transposes them to the query protein. Transposed waters are clustered by their mutual proximity, which enables identification of discrete sites in the query protein with high water conservation. ProBiS H2O is a robust and fast new approach that uses existing experimental structural data to identify conserved water sites on the interfaces of protein complexes, for example protein-small molecule interfaces, and elsewhere on the protein structures. It has been successfully validated in several reported proteins in which conserved water molecules were found to play an important role in ligand binding with applications in drug design.
Computational intelligence techniques for biological data mining: An overview
NASA Astrophysics Data System (ADS)
Faye, Ibrahima; Iqbal, Muhammad Javed; Said, Abas Md; Samir, Brahim Belhaouari
2014-10-01
Computational techniques have been successfully utilized for a highly accurate analysis and modeling of multifaceted and raw biological data gathered from various genome sequencing projects. These techniques are proving much more effective to overcome the limitations of the traditional in-vitro experiments on the constantly increasing sequence data. However, most critical problems that caught the attention of the researchers may include, but not limited to these: accurate structure and function prediction of unknown proteins, protein subcellular localization prediction, finding protein-protein interactions, protein fold recognition, analysis of microarray gene expression data, etc. To solve these problems, various classification and clustering techniques using machine learning have been extensively used in the published literature. These techniques include neural network algorithms, genetic algorithms, fuzzy ARTMAP, K-Means, K-NN, SVM, Rough set classifiers, decision tree and HMM based algorithms. Major difficulties in applying the above algorithms include the limitations found in the previous feature encoding and selection methods while extracting the best features, increasing classification accuracy and decreasing the running time overheads of the learning algorithms. The application of this research would be potentially useful in the drug design and in the diagnosis of some diseases. This paper presents a concise overview of the well-known protein classification techniques.
PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta.
Chaudhury, Sidhartha; Lyskov, Sergey; Gray, Jeffrey J
2010-03-01
PyRosetta is a stand-alone Python-based implementation of the Rosetta molecular modeling package that allows users to write custom structure prediction and design algorithms using the major Rosetta sampling and scoring functions. PyRosetta contains Python bindings to libraries that define Rosetta functions including those for accessing and manipulating protein structure, calculating energies and running Monte Carlo-based simulations. PyRosetta can be used in two ways: (i) interactively, using iPython and (ii) script-based, using Python scripting. Interactive mode contains a number of help features and is ideal for beginners while script-mode is best suited for algorithm development. PyRosetta has similar computational performance to Rosetta, can be easily scaled up for cluster applications and has been implemented for algorithms demonstrating protein docking, protein folding, loop modeling and design. PyRosetta is a stand-alone package available at http://www.pyrosetta.org under the Rosetta license which is free for academic and non-profit users. A tutorial, user's manual and sample scripts demonstrating usage are also available on the web site.
PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta
Chaudhury, Sidhartha; Lyskov, Sergey; Gray, Jeffrey J.
2010-01-01
Summary: PyRosetta is a stand-alone Python-based implementation of the Rosetta molecular modeling package that allows users to write custom structure prediction and design algorithms using the major Rosetta sampling and scoring functions. PyRosetta contains Python bindings to libraries that define Rosetta functions including those for accessing and manipulating protein structure, calculating energies and running Monte Carlo-based simulations. PyRosetta can be used in two ways: (i) interactively, using iPython and (ii) script-based, using Python scripting. Interactive mode contains a number of help features and is ideal for beginners while script-mode is best suited for algorithm development. PyRosetta has similar computational performance to Rosetta, can be easily scaled up for cluster applications and has been implemented for algorithms demonstrating protein docking, protein folding, loop modeling and design. Availability: PyRosetta is a stand-alone package available at http://www.pyrosetta.org under the Rosetta license which is free for academic and non-profit users. A tutorial, user's manual and sample scripts demonstrating usage are also available on the web site. Contact: pyrosetta@graylab.jhu.edu PMID:20061306
An Integrated Framework Advancing Membrane Protein Modeling and Design
Weitzner, Brian D.; Duran, Amanda M.; Tilley, Drew C.; Elazar, Assaf; Gray, Jeffrey J.
2015-01-01
Membrane proteins are critical functional molecules in the human body, constituting more than 30% of open reading frames in the human genome. Unfortunately, a myriad of difficulties in overexpression and reconstitution into membrane mimetics severely limit our ability to determine their structures. Computational tools are therefore instrumental to membrane protein structure prediction, consequently increasing our understanding of membrane protein function and their role in disease. Here, we describe a general framework facilitating membrane protein modeling and design that combines the scientific principles for membrane protein modeling with the flexible software architecture of Rosetta3. This new framework, called RosettaMP, provides a general membrane representation that interfaces with scoring, conformational sampling, and mutation routines that can be easily combined to create new protocols. To demonstrate the capabilities of this implementation, we developed four proof-of-concept applications for (1) prediction of free energy changes upon mutation; (2) high-resolution structural refinement; (3) protein-protein docking; and (4) assembly of symmetric protein complexes, all in the membrane environment. Preliminary data show that these algorithms can produce meaningful scores and structures. The data also suggest needed improvements to both sampling routines and score functions. Importantly, the applications collectively demonstrate the potential of combining the flexible nature of RosettaMP with the power of Rosetta algorithms to facilitate membrane protein modeling and design. PMID:26325167
Cytoprophet: a Cytoscape plug-in for protein and domain interaction networks inference.
Morcos, Faruck; Lamanna, Charles; Sikora, Marcin; Izaguirre, Jesús
2008-10-01
Cytoprophet is a software tool that allows prediction and visualization of protein and domain interaction networks. It is implemented as a plug-in of Cytoscape, an open source software framework for analysis and visualization of molecular networks. Cytoprophet implements three algorithms that predict new potential physical interactions using the domain composition of proteins and experimental assays. The algorithms for protein and domain interaction inference include maximum likelihood estimation (MLE) using expectation maximization (EM); the set cover approach maximum specificity set cover (MSSC) and the sum-product algorithm (SPA). After accepting an input set of proteins with Uniprot ID/Accession numbers and a selected prediction algorithm, Cytoprophet draws a network of potential interactions with probability scores and GO distances as edge attributes. A network of domain interactions between the domains of the initial protein list can also be generated. Cytoprophet was designed to take advantage of the visual capabilities of Cytoscape and be simple to use. An example of inference in a signaling network of myxobacterium Myxococcus xanthus is presented and available at Cytoprophet's website. http://cytoprophet.cse.nd.edu.
PinaColada: peptide-inhibitor ant colony ad-hoc design algorithm.
Zaidman, Daniel; Wolfson, Haim J
2016-08-01
Design of protein-protein interaction (PPI) inhibitors is a major challenge in Structural Bioinformatics. Peptides, especially short ones (5-15 amino acid long), are natural candidates for inhibition of protein-protein complexes due to several attractive features such as high structural compatibility with the protein binding site (mimicking the surface of one of the proteins), small size and the ability to form strong hotspot binding connections with the protein surface. Efficient rational peptide design is still a major challenge in computer aided drug design, due to the huge space of possible sequences, which is exponential in the length of the peptide, and the high flexibility of peptide conformations. In this article we present PinaColada, a novel computational method for the design of peptide inhibitors for protein-protein interactions. We employ a version of the ant colony optimization heuristic, which is used to explore the exponential space ([Formula: see text]) of length n peptide sequences, in combination with our fast robotics motivated PepCrawler algorithm, which explores the conformational space for each candidate sequence. PinaColada is being run in parallel, on a DELL PowerEdge 2.8 GHZ computer with 20 cores and 256 GB memory, and takes up to 24 h to design a peptide of 5-15 amino acids length. An online server available at: http://bioinfo3d.cs.tau.ac.il/PinaColada/. danielza@post.tau.ac.il; wolfson@tau.ac.il. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Raymond, Amy; Lovell, Scott; Lorimer, Don
2009-12-01
With the goal of improving yield and success rates of heterologous protein production for structural studies we have developed the database and algorithm software package Gene Composer. This freely available electronic tool facilitates the information-rich design of protein constructs and their engineered synthetic gene sequences, as detailed in the accompanying manuscript. In this report, we compare heterologous protein expression levels from native sequences to that of codon engineered synthetic gene constructs designed by Gene Composer. A test set of proteins including a human kinase (P38{alpha}), viral polymerase (HCV NS5B), and bacterial structural protein (FtsZ) were expressed in both E. colimore » and a cell-free wheat germ translation system. We also compare the protein expression levels in E. coli for a set of 11 different proteins with greatly varied G:C content and codon bias. The results consistently demonstrate that protein yields from codon engineered Gene Composer designs are as good as or better than those achieved from the synonymous native genes. Moreover, structure guided N- and C-terminal deletion constructs designed with the aid of Gene Composer can lead to greater success in gene to structure work as exemplified by the X-ray crystallographic structure determination of FtsZ from Bacillus subtilis. These results validate the Gene Composer algorithms, and suggest that using a combination of synthetic gene and protein construct engineering tools can improve the economics of gene to structure research.« less
Tandem Repeats in Proteins: Prediction Algorithms and Biological Role.
Pellegrini, Marco
2015-01-01
Tandem repetitions in protein sequence and structure is a fascinating subject of research which has been a focus of study since the late 1990s. In this survey, we give an overview on the multi-faceted aspects of research on protein tandem repeats (PTR for short), including prediction algorithms, databases, early classification efforts, mechanisms of PTR formation and evolution, and synthetic PTR design. We also touch on the rather open issue of the relationship between PTR and flexibility (or disorder) in proteins. Detection of PTR either from protein sequence or structure data is challenging due to inherent high (biological) signal-to-noise ratio that is a key feature of this problem. As early in silico analytic tools have been key enablers for starting this field of study, we expect that current and future algorithmic and statistical breakthroughs will have a high impact on the investigations of the biological role of PTR.
Gene Composer: database software for protein construct design, codon engineering, and gene synthesis
Lorimer, Don; Raymond, Amy; Walchli, John; Mixon, Mark; Barrow, Adrienne; Wallace, Ellen; Grice, Rena; Burgin, Alex; Stewart, Lance
2009-01-01
Background To improve efficiency in high throughput protein structure determination, we have developed a database software package, Gene Composer, which facilitates the information-rich design of protein constructs and their codon engineered synthetic gene sequences. With its modular workflow design and numerous graphical user interfaces, Gene Composer enables researchers to perform all common bio-informatics steps used in modern structure guided protein engineering and synthetic gene engineering. Results An interactive Alignment Viewer allows the researcher to simultaneously visualize sequence conservation in the context of known protein secondary structure, ligand contacts, water contacts, crystal contacts, B-factors, solvent accessible area, residue property type and several other useful property views. The Construct Design Module enables the facile design of novel protein constructs with altered N- and C-termini, internal insertions or deletions, point mutations, and desired affinity tags. The modifications can be combined and permuted into multiple protein constructs, and then virtually cloned in silico into defined expression vectors. The Gene Design Module uses a protein-to-gene algorithm that automates the back-translation of a protein amino acid sequence into a codon engineered nucleic acid gene sequence according to a selected codon usage table with minimal codon usage threshold, defined G:C% content, and desired sequence features achieved through synonymous codon selection that is optimized for the intended expression system. The gene-to-oligo algorithm of the Gene Design Module plans out all of the required overlapping oligonucleotides and mutagenic primers needed to synthesize the desired gene constructs by PCR, and for physically cloning them into selected vectors by the most popular subcloning strategies. Conclusion We present a complete description of Gene Composer functionality, and an efficient PCR-based synthetic gene assembly procedure with mis-match specific endonuclease error correction in combination with PIPE cloning. In a sister manuscript we present data on how Gene Composer designed genes and protein constructs can result in improved protein production for structural studies. PMID:19383142
Lorimer, Don; Raymond, Amy; Walchli, John; Mixon, Mark; Barrow, Adrienne; Wallace, Ellen; Grice, Rena; Burgin, Alex; Stewart, Lance
2009-04-21
To improve efficiency in high throughput protein structure determination, we have developed a database software package, Gene Composer, which facilitates the information-rich design of protein constructs and their codon engineered synthetic gene sequences. With its modular workflow design and numerous graphical user interfaces, Gene Composer enables researchers to perform all common bio-informatics steps used in modern structure guided protein engineering and synthetic gene engineering. An interactive Alignment Viewer allows the researcher to simultaneously visualize sequence conservation in the context of known protein secondary structure, ligand contacts, water contacts, crystal contacts, B-factors, solvent accessible area, residue property type and several other useful property views. The Construct Design Module enables the facile design of novel protein constructs with altered N- and C-termini, internal insertions or deletions, point mutations, and desired affinity tags. The modifications can be combined and permuted into multiple protein constructs, and then virtually cloned in silico into defined expression vectors. The Gene Design Module uses a protein-to-gene algorithm that automates the back-translation of a protein amino acid sequence into a codon engineered nucleic acid gene sequence according to a selected codon usage table with minimal codon usage threshold, defined G:C% content, and desired sequence features achieved through synonymous codon selection that is optimized for the intended expression system. The gene-to-oligo algorithm of the Gene Design Module plans out all of the required overlapping oligonucleotides and mutagenic primers needed to synthesize the desired gene constructs by PCR, and for physically cloning them into selected vectors by the most popular subcloning strategies. We present a complete description of Gene Composer functionality, and an efficient PCR-based synthetic gene assembly procedure with mis-match specific endonuclease error correction in combination with PIPE cloning. In a sister manuscript we present data on how Gene Composer designed genes and protein constructs can result in improved protein production for structural studies.
Computational Modeling of Proteins based on Cellular Automata: A Method of HP Folding Approximation.
Madain, Alia; Abu Dalhoum, Abdel Latif; Sleit, Azzam
2018-06-01
The design of a protein folding approximation algorithm is not straightforward even when a simplified model is used. The folding problem is a combinatorial problem, where approximation and heuristic algorithms are usually used to find near optimal folds of proteins primary structures. Approximation algorithms provide guarantees on the distance to the optimal solution. The folding approximation approach proposed here depends on two-dimensional cellular automata to fold proteins presented in a well-studied simplified model called the hydrophobic-hydrophilic model. Cellular automata are discrete computational models that rely on local rules to produce some overall global behavior. One-third and one-fourth approximation algorithms choose a subset of the hydrophobic amino acids to form H-H contacts. Those algorithms start with finding a point to fold the protein sequence into two sides where one side ignores H's at even positions and the other side ignores H's at odd positions. In addition, blocks or groups of amino acids fold the same way according to a predefined normal form. We intend to improve approximation algorithms by considering all hydrophobic amino acids and folding based on the local neighborhood instead of using normal forms. The CA does not assume a fixed folding point. The proposed approach guarantees one half approximation minus the H-H endpoints. This lower bound guaranteed applies to short sequences only. This is proved as the core and the folds of the protein will have two identical sides for all short sequences.
Li, Guo-Zhong; Vissers, Johannes P C; Silva, Jeffrey C; Golick, Dan; Gorenstein, Marc V; Geromanos, Scott J
2009-03-01
A novel database search algorithm is presented for the qualitative identification of proteins over a wide dynamic range, both in simple and complex biological samples. The algorithm has been designed for the analysis of data originating from data independent acquisitions, whereby multiple precursor ions are fragmented simultaneously. Measurements used by the algorithm include retention time, ion intensities, charge state, and accurate masses on both precursor and product ions from LC-MS data. The search algorithm uses an iterative process whereby each iteration incrementally increases the selectivity, specificity, and sensitivity of the overall strategy. Increased specificity is obtained by utilizing a subset database search approach, whereby for each subsequent stage of the search, only those peptides from securely identified proteins are queried. Tentative peptide and protein identifications are ranked and scored by their relative correlation to a number of models of known and empirically derived physicochemical attributes of proteins and peptides. In addition, the algorithm utilizes decoy database techniques for automatically determining the false positive identification rates. The search algorithm has been tested by comparing the search results from a four-protein mixture, the same four-protein mixture spiked into a complex biological background, and a variety of other "system" type protein digest mixtures. The method was validated independently by data dependent methods, while concurrently relying on replication and selectivity. Comparisons were also performed with other commercially and publicly available peptide fragmentation search algorithms. The presented results demonstrate the ability to correctly identify peptides and proteins from data independent acquisition strategies with high sensitivity and specificity. They also illustrate a more comprehensive analysis of the samples studied; providing approximately 20% more protein identifications, compared to a more conventional data directed approach using the same identification criteria, with a concurrent increase in both sequence coverage and the number of modified peptides.
Courcelles, Mathieu; Coulombe-Huntington, Jasmin; Cossette, Émilie; Gingras, Anne-Claude; Thibault, Pierre; Tyers, Mike
2017-07-07
Protein cross-linking mass spectrometry (CL-MS) enables the sensitive detection of protein interactions and the inference of protein complex topology. The detection of chemical cross-links between protein residues can identify intra- and interprotein contact sites or provide physical constraints for molecular modeling of protein structure. Recent innovations in cross-linker design, sample preparation, mass spectrometry, and software tools have significantly improved CL-MS approaches. Although a number of algorithms now exist for the identification of cross-linked peptides from mass spectral data, a dearth of user-friendly analysis tools represent a practical bottleneck to the broad adoption of the approach. To facilitate the analysis of CL-MS data, we developed CLMSVault, a software suite designed to leverage existing CL-MS algorithms and provide intuitive and flexible tools for cross-platform data interpretation. CLMSVault stores and combines complementary information obtained from different cross-linkers and search algorithms. CLMSVault provides filtering, comparison, and visualization tools to support CL-MS analyses and includes a workflow for label-free quantification of cross-linked peptides. An embedded 3D viewer enables the visualization of quantitative data and the mapping of cross-linked sites onto PDB structural models. We demonstrate the application of CLMSVault for the analysis of a noncovalent Cdc34-ubiquitin protein complex cross-linked under different conditions. CLMSVault is open-source software (available at https://gitlab.com/courcelm/clmsvault.git ), and a live demo is available at http://democlmsvault.tyerslab.com/ .
Computational design of chimeric protein libraries for directed evolution.
Silberg, Jonathan J; Nguyen, Peter Q; Stevenson, Taylor
2010-01-01
The best approach for creating libraries of functional proteins with large numbers of nondisruptive amino acid substitutions is protein recombination, in which structurally related polypeptides are swapped among homologous proteins. Unfortunately, as more distantly related proteins are recombined, the fraction of variants having a disrupted structure increases. One way to enrich the fraction of folded and potentially interesting chimeras in these libraries is to use computational algorithms to anticipate which structural elements can be swapped without disturbing the integrity of a protein's structure. Herein, we describe how the algorithm Schema uses the sequences and structures of the parent proteins recombined to predict the structural disruption of chimeras, and we outline how dynamic programming can be used to find libraries with a range of amino acid substitution levels that are enriched in variants with low Schema disruption.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Murphy, Grant S.; Mills, Jeffrey L.; Miley, Michael J.
2015-10-15
Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures, most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helixmore » bundle protein. Only small perturbations to the backbone, 12 {angstrom}, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point >140C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 {angstrom}).« less
Computational design of environmental sensors for the potent opioid fentanyl
Bick, Matthew J.; Greisen, Per J.; Morey, Kevin J.; ...
2017-09-19
Here, we describe the computational design of proteins that bind the potent analgesic fentanyl. Our approach employs a fast docking algorithm to find shape complementary ligand placement in protein scaffolds, followed by design of the surrounding residues to optimize binding affinity. Co-crystal structures of the highest affinity binder reveal a highly preorganized binding site, and an overall architecture and ligand placement in close agreement with the design model. We also use the designs to generate plant sensors for fentanyl by coupling ligand binding to design stability. The method should be generally useful for detecting toxic hydrophobic compounds in the environment.
Computational design of environmental sensors for the potent opioid fentanyl
Morey, Kevin J; Antunes, Mauricio S; La, David; Sankaran, Banumathi; Reymond, Luc; Johnsson, Kai; Medford, June I
2017-01-01
We describe the computational design of proteins that bind the potent analgesic fentanyl. Our approach employs a fast docking algorithm to find shape complementary ligand placement in protein scaffolds, followed by design of the surrounding residues to optimize binding affinity. Co-crystal structures of the highest affinity binder reveal a highly preorganized binding site, and an overall architecture and ligand placement in close agreement with the design model. We use the designs to generate plant sensors for fentanyl by coupling ligand binding to design stability. The method should be generally useful for detecting toxic hydrophobic compounds in the environment. PMID:28925919
Computational design of environmental sensors for the potent opioid fentanyl
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bick, Matthew J.; Greisen, Per J.; Morey, Kevin J.
Here, we describe the computational design of proteins that bind the potent analgesic fentanyl. Our approach employs a fast docking algorithm to find shape complementary ligand placement in protein scaffolds, followed by design of the surrounding residues to optimize binding affinity. Co-crystal structures of the highest affinity binder reveal a highly preorganized binding site, and an overall architecture and ligand placement in close agreement with the design model. We also use the designs to generate plant sensors for fentanyl by coupling ligand binding to design stability. The method should be generally useful for detecting toxic hydrophobic compounds in the environment.
Memetic algorithms for de novo motif-finding in biomedical sequences.
Bi, Chengpeng
2012-09-01
The objectives of this study are to design and implement a new memetic algorithm for de novo motif discovery, which is then applied to detect important signals hidden in various biomedical molecular sequences. In this paper, memetic algorithms are developed and tested in de novo motif-finding problems. Several strategies in the algorithm design are employed that are to not only efficiently explore the multiple sequence local alignment space, but also effectively uncover the molecular signals. As a result, there are a number of key features in the implementation of the memetic motif-finding algorithm (MaMotif), including a chromosome replacement operator, a chromosome alteration-aware local search operator, a truncated local search strategy, and a stochastic operation of local search imposed on individual learning. To test the new algorithm, we compare MaMotif with a few of other similar algorithms using simulated and experimental data including genomic DNA, primary microRNA sequences (let-7 family), and transmembrane protein sequences. The new memetic motif-finding algorithm is successfully implemented in C++, and exhaustively tested with various simulated and real biological sequences. In the simulation, it shows that MaMotif is the most time-efficient algorithm compared with others, that is, it runs 2 times faster than the expectation maximization (EM) method and 16 times faster than the genetic algorithm-based EM hybrid. In both simulated and experimental testing, results show that the new algorithm is compared favorably or superior to other algorithms. Notably, MaMotif is able to successfully discover the transcription factors' binding sites in the chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) data, correctly uncover the RNA splicing signals in gene expression, and precisely find the highly conserved helix motif in the transmembrane protein sequences, as well as rightly detect the palindromic segments in the primary microRNA sequences. The memetic motif-finding algorithm is effectively designed and implemented, and its applications demonstrate it is not only time-efficient, but also exhibits excellent performance while compared with other popular algorithms. Copyright © 2012 Elsevier B.V. All rights reserved.
Choi, Yoonjoo; Verma, Deeptak; Griswold, Karl E; Bailey-Kellogg, Chris
2017-01-01
Therapeutic proteins are yielding ever more advanced and efficacious new drugs, but the biological origins of these highly effective therapeutics render them subject to immune surveillance within the patient's body. When recognized by the immune system as a foreign agent, protein drugs elicit a coordinated response that can manifest a range of clinical complications including rapid drug clearance, loss of functionality and efficacy, delayed infusion-like allergic reactions, more serious anaphylactic shock, and even induced auto-immunity. It is thus often necessary to deimmunize an exogenous protein in order to enable its clinical application; critically, the deimmunization process must also maintain the desired therapeutic activity.To meet the growing need for effective, efficient, and broadly applicable protein deimmunization technologies, we have developed the EpiSweep suite of protein design algorithms. EpiSweep seamlessly integrates computational prediction of immunogenic T cell epitopes with sequence- or structure-based assessment of the impacts of mutations on protein stability and function, in order to select combinations of mutations that make Pareto optimal trade-offs between the competing goals of low immunogenicity and high-level function. The methods are applicable both to the design of individual functionally deimmunized variants as well as the design of combinatorial libraries enriched in functionally deimmunized variants. After validating EpiSweep in a series of retrospective case studies providing comparisons to conventional approaches to T cell epitope deletion, we have experimentally demonstrated it to be highly effective in prospective application to deimmunization of a number of different therapeutic candidates. We conclude that our broadly applicable computational protein design algorithms guide the engineer towards the most promising deimmunized therapeutic candidates, and thereby have the potential to accelerate development of new protein drugs by shortening time frames and improving hit rates.
Choi, Yoonjoo; Verma, Deeptak; Griswold, Karl E.; Bailey-Kellogg, Chris
2016-01-01
Therapeutic proteins are yielding ever more advanced and efficacious new drugs, but the biological origins of these highly effective therapeutics renders them subject to immune surveillance within the patient’s body. When recognized by the immune system as a foreign agent, protein drugs elicit a coordinated response that can manifest a range of clinical complications including rapid drug clearance, loss of functionality and efficacy, delayed infusion-like allergic reactions, more serious anaphylactic shock, and even induced auto-immunity. It is thus often necessary to deimmunize an exogenous protein in order to enable its clinical application; critically, the deimmunization process must also maintain the desired therapeutic activity. To meet the growing need for effective, efficient, and broadly applicable protein deimmunization technologies, we have developed the EpiSweep suite of protein design algorithms. EpiSweep seamlessly integrates computational prediction of immunogenic T cell epitopes with sequence- or structure- based assessment of the impacts of mutations on protein stability and function, in order to select combinations of mutations that make Pareto optimal trade-offs between the competing goals of low immunogenicity and high-level function. The methods are applicable both to the design of individual functionally deimmunized variants as well as the design of combinatorial libraries enriched in functionally deimmunized variants. After validating EpiSweep in a series of retrospective case studies providing comparisons to conventional approaches to T cell epitope deletion, we have experimentally demonstrated it to be highly effective in prospective application to deimmunization of a number of different therapeutic candidates. We conclude that our broadly applicable computational protein design algorithms guide the engineer towards the most promising deimmunized therapeutic candidates, and thereby have the potential to accelerate development of new protein drugs by shortening time frames and improving hit rates. PMID:27914063
Accelerating large-scale protein structure alignments with graphics processing units
2012-01-01
Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU. PMID:22357132
ProteMiner-SSM: a web server for efficient analysis of similar protein tertiary substructures.
Chang, Darby Tien-Hau; Chen, Chien-Yu; Chung, Wen-Chin; Oyang, Yen-Jen; Juan, Hsueh-Fen; Huang, Hsuan-Cheng
2004-07-01
Analysis of protein-ligand interactions is a fundamental issue in drug design. As the detailed and accurate analysis of protein-ligand interactions involves calculation of binding free energy based on thermodynamics and even quantum mechanics, which is highly expensive in terms of computing time, conformational and structural analysis of proteins and ligands has been widely employed as a screening process in computer-aided drug design. In this paper, a web server called ProteMiner-SSM designed for efficient analysis of similar protein tertiary substructures is presented. In one experiment reported in this paper, the web server has been exploited to obtain some clues about a biochemical hypothesis. The main distinction in the software design of the web server is the filtering process incorporated to expedite the analysis. The filtering process extracts the residues located in the caves of the protein tertiary structure for analysis and operates with O(nlogn) time complexity, where n is the number of residues in the protein. In comparison, the alpha-hull algorithm, which is a widely used algorithm in computer graphics for identifying those instances that are on the contour of a three-dimensional object, features O(n2) time complexity. Experimental results show that the filtering process presented in this paper is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times. The ProteMiner-SSM web server can be found at http://proteminer.csie.ntu.edu.tw/. There is a mirror site at http://p4.sbl.bc.sinica.edu.tw/proteminer/.
Exploring the Energy Landscapes of Protein Folding Simulations with Bayesian Computation
Burkoff, Nikolas S.; Várnai, Csilla; Wells, Stephen A.; Wild, David L.
2012-01-01
Nested sampling is a Bayesian sampling technique developed to explore probability distributions localized in an exponentially small area of the parameter space. The algorithm provides both posterior samples and an estimate of the evidence (marginal likelihood) of the model. The nested sampling algorithm also provides an efficient way to calculate free energies and the expectation value of thermodynamic observables at any temperature, through a simple post processing of the output. Previous applications of the algorithm have yielded large efficiency gains over other sampling techniques, including parallel tempering. In this article, we describe a parallel implementation of the nested sampling algorithm and its application to the problem of protein folding in a Gō-like force field of empirical potentials that were designed to stabilize secondary structure elements in room-temperature simulations. We demonstrate the method by conducting folding simulations on a number of small proteins that are commonly used for testing protein-folding procedures. A topological analysis of the posterior samples is performed to produce energy landscape charts, which give a high-level description of the potential energy surface for the protein folding simulations. These charts provide qualitative insights into both the folding process and the nature of the model and force field used. PMID:22385859
Exploring the energy landscapes of protein folding simulations with Bayesian computation.
Burkoff, Nikolas S; Várnai, Csilla; Wells, Stephen A; Wild, David L
2012-02-22
Nested sampling is a Bayesian sampling technique developed to explore probability distributions localized in an exponentially small area of the parameter space. The algorithm provides both posterior samples and an estimate of the evidence (marginal likelihood) of the model. The nested sampling algorithm also provides an efficient way to calculate free energies and the expectation value of thermodynamic observables at any temperature, through a simple post processing of the output. Previous applications of the algorithm have yielded large efficiency gains over other sampling techniques, including parallel tempering. In this article, we describe a parallel implementation of the nested sampling algorithm and its application to the problem of protein folding in a Gō-like force field of empirical potentials that were designed to stabilize secondary structure elements in room-temperature simulations. We demonstrate the method by conducting folding simulations on a number of small proteins that are commonly used for testing protein-folding procedures. A topological analysis of the posterior samples is performed to produce energy landscape charts, which give a high-level description of the potential energy surface for the protein folding simulations. These charts provide qualitative insights into both the folding process and the nature of the model and force field used. Copyright © 2012 Biophysical Society. Published by Elsevier Inc. All rights reserved.
Wang, LiQiang; Li, CuiFeng
2014-10-01
A genetic algorithm (GA) coupled with multiple linear regression (MLR) was used to extract useful features from amino acids and g-gap dipeptides for distinguishing between thermophilic and non-thermophilic proteins. The method was trained by a benchmark dataset of 915 thermophilic and 793 non-thermophilic proteins. The method reached an overall accuracy of 95.4 % in a Jackknife test using nine amino acids, 38 0-gap dipeptides and 29 1-gap dipeptides. The accuracy as a function of protein size ranged between 85.8 and 96.9 %. The overall accuracies of three independent tests were 93, 93.4 and 91.8 %. The observed results of detecting thermophilic proteins suggest that the GA-MLR approach described herein should be a powerful method for selecting features that describe thermostabile machines and be an aid in the design of more stable proteins.
NASA Astrophysics Data System (ADS)
Lestari, D.; Raharjo, D.; Bustamam, A.; Abdillah, B.; Widhianto, W.
2017-07-01
Dengue virus consists of 10 different constituent proteins and are classified into 4 major serotypes (DEN 1 - DEN 4). This study was designed to perform clustering against 30 protein sequences of dengue virus taken from Virus Pathogen Database and Analysis Resource (VIPR) using Regularized Markov Clustering (R-MCL) algorithm and then we analyze the result. By using Python program 3.4, R-MCL algorithm produces 8 clusters with more than one centroid in several clusters. The number of centroid shows the density level of interaction. Protein interactions that are connected in a tissue, form a complex protein that serves as a specific biological process unit. The analysis of result shows the R-MCL clustering produces clusters of dengue virus family based on the similarity role of their constituent protein, regardless of serotypes.
Armean, Irina M; Lilley, Kathryn S; Trotter, Matthew W B; Pilkington, Nicholas C V; Holden, Sean B
2018-06-01
Protein-protein interactions (PPI) play a crucial role in our understanding of protein function and biological processes. The standardization and recording of experimental findings is increasingly stored in ontologies, with the Gene Ontology (GO) being one of the most successful projects. Several PPI evaluation algorithms have been based on the application of probabilistic frameworks or machine learning algorithms to GO properties. Here, we introduce a new training set design and machine learning based approach that combines dependent heterogeneous protein annotations from the entire ontology to evaluate putative co-complex protein interactions determined by empirical studies. PPI annotations are built combinatorically using corresponding GO terms and InterPro annotation. We use a S.cerevisiae high-confidence complex dataset as a positive training set. A series of classifiers based on Maximum Entropy and support vector machines (SVMs), each with a composite counterpart algorithm, are trained on a series of training sets. These achieve a high performance area under the ROC curve of ≤0.97, outperforming go2ppi-a previously established prediction tool for protein-protein interactions (PPI) based on Gene Ontology (GO) annotations. https://github.com/ima23/maxent-ppi. sbh11@cl.cam.ac.uk. Supplementary data are available at Bioinformatics online.
Automated design evolution of stereochemically randomized protein foldamers
NASA Astrophysics Data System (ADS)
Ranbhor, Ranjit; Kumar, Anil; Patel, Kirti; Ramakrishnan, Vibin; Durani, Susheel
2018-05-01
Diversification of chain stereochemistry opens up the possibilities of an ‘in principle’ increase in the design space of proteins. This huge increase in the sequence and consequent structural variation is aimed at the generation of smart materials. To diversify protein structure stereochemically, we introduced L- and D-α-amino acids as the design alphabet. With a sequence design algorithm, we explored the usage of specific variables such as chirality and the sequence of this alphabet in independent steps. With molecular dynamics, we folded stereochemically diverse homopolypeptides and evaluated their ‘fitness’ for possible design as protein-like foldamers. We propose a fitness function to prune the most optimal fold among 1000 structures simulated with an automated repetitive simulated annealing molecular dynamics (AR-SAMD) approach. The highly scored poly-leucine fold with sequence lengths of 24 and 30 amino acids were later sequence-optimized using a Dead End Elimination cum Monte Carlo based optimization tool. This paper demonstrates a novel approach for the de novo design of protein-like foldamers.
Rosetta:MSF: a modular framework for multi-state computational protein design.
Löffler, Patrick; Schmitz, Samuel; Hupfeld, Enrico; Sterner, Reinhard; Merkl, Rainer
2017-06-01
Computational protein design (CPD) is a powerful technique to engineer existing proteins or to design novel ones that display desired properties. Rosetta is a software suite including algorithms for computational modeling and analysis of protein structures and offers many elaborate protocols created to solve highly specific tasks of protein engineering. Most of Rosetta's protocols optimize sequences based on a single conformation (i. e. design state). However, challenging CPD objectives like multi-specificity design or the concurrent consideration of positive and negative design goals demand the simultaneous assessment of multiple states. This is why we have developed the multi-state framework MSF that facilitates the implementation of Rosetta's single-state protocols in a multi-state environment and made available two frequently used protocols. Utilizing MSF, we demonstrated for one of these protocols that multi-state design yields a 15% higher performance than single-state design on a ligand-binding benchmark consisting of structural conformations. With this protocol, we designed de novo nine retro-aldolases on a conformational ensemble deduced from a (βα)8-barrel protein. All variants displayed measurable catalytic activity, testifying to a high success rate for this concept of multi-state enzyme design.
Rosetta:MSF: a modular framework for multi-state computational protein design
Hupfeld, Enrico; Sterner, Reinhard
2017-01-01
Computational protein design (CPD) is a powerful technique to engineer existing proteins or to design novel ones that display desired properties. Rosetta is a software suite including algorithms for computational modeling and analysis of protein structures and offers many elaborate protocols created to solve highly specific tasks of protein engineering. Most of Rosetta’s protocols optimize sequences based on a single conformation (i. e. design state). However, challenging CPD objectives like multi-specificity design or the concurrent consideration of positive and negative design goals demand the simultaneous assessment of multiple states. This is why we have developed the multi-state framework MSF that facilitates the implementation of Rosetta’s single-state protocols in a multi-state environment and made available two frequently used protocols. Utilizing MSF, we demonstrated for one of these protocols that multi-state design yields a 15% higher performance than single-state design on a ligand-binding benchmark consisting of structural conformations. With this protocol, we designed de novo nine retro-aldolases on a conformational ensemble deduced from a (βα)8-barrel protein. All variants displayed measurable catalytic activity, testifying to a high success rate for this concept of multi-state enzyme design. PMID:28604768
Protein Engineering Approaches in the Post-Genomic Era.
Singh, Raushan K; Lee, Jung-Kul; Selvaraj, Chandrabose; Singh, Ranjitha; Li, Jinglin; Kim, Sang-Yong; Kalia, Vipin C
2018-01-01
Proteins are one of the most multifaceted macromolecules in living systems. Proteins have evolved to function under physiological conditions and, therefore, are not usually tolerant of harsh experimental and environmental conditions. The growing use of proteins in industrial processes as a greener alternative to chemical catalysts often demands constant innovation to improve their performance. Protein engineering aims to design new proteins or modify the sequence of a protein to create proteins with new or desirable functions. With the emergence of structural and functional genomics, protein engineering has been invigorated in the post-genomic era. The three-dimensional structures of proteins with known functions facilitate protein engineering approaches to design variants with desired properties. There are three major approaches of protein engineering research, namely, directed evolution, rational design, and de novo design. Rational design is an effective method of protein engineering when the threedimensional structure and mechanism of the protein is well known. In contrast, directed evolution does not require extensive information and a three-dimensional structure of the protein of interest. Instead, it involves random mutagenesis and selection to screen enzymes with desired properties. De novo design uses computational protein design algorithms to tailor synthetic proteins by using the three-dimensional structures of natural proteins and their folding rules. The present review highlights and summarizes recent protein engineering approaches, and their challenges and limitations in the post-genomic era. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Playing biology's name game: identifying protein names in scientific text.
Hanisch, Daniel; Fluck, Juliane; Mevissen, Heinz-Theodor; Zimmer, Ralf
2003-01-01
A growing body of work is devoted to the extraction of protein or gene interaction information from the scientific literature. Yet, the basis for most extraction algorithms, i.e. the specific and sensitive recognition of protein and gene names and their numerous synonyms, has not been adequately addressed. Here we describe the construction of a comprehensive general purpose name dictionary and an accompanying automatic curation procedure based on a simple token model of protein names. We designed an efficient search algorithm to analyze all abstracts in MEDLINE in a reasonable amount of time on standard computers. The parameters of our method are optimized using machine learning techniques. Used in conjunction, these ingredients lead to good search performance. A supplementary web page is available at http://cartan.gmd.de/ProMiner/.
Direct Calculation of Protein Fitness Landscapes through Computational Protein Design
Au, Loretta; Green, David F.
2016-01-01
Naturally selected amino-acid sequences or experimentally derived ones are often the basis for understanding how protein three-dimensional conformation and function are determined by primary structure. Such sequences for a protein family comprise only a small fraction of all possible variants, however, representing the fitness landscape with limited scope. Explicitly sampling and characterizing alternative, unexplored protein sequences would directly identify fundamental reasons for sequence robustness (or variability), and we demonstrate that computational methods offer an efficient mechanism toward this end, on a large scale. The dead-end elimination and A∗ search algorithms were used here to find all low-energy single mutant variants, and corresponding structures of a G-protein heterotrimer, to measure changes in structural stability and binding interactions to define a protein fitness landscape. We established consistency between these algorithms with known biophysical and evolutionary trends for amino-acid substitutions, and could thus recapitulate known protein side-chain interactions and predict novel ones. PMID:26745411
Viricel, Clément; de Givry, Simon; Schiex, Thomas; Barbe, Sophie
2018-02-20
Accurate and economic methods to predict change in protein binding free energy upon mutation are imperative to accelerate the design of proteins for a wide range of applications. Free energy is defined by enthalpic and entropic contributions. Following the recent progresses of Artificial Intelligence-based algorithms for guaranteed NP-hard energy optimization and partition function computation, it becomes possible to quickly compute minimum energy conformations and to reliably estimate the entropic contribution of side-chains in the change of free energy of large protein interfaces. Using guaranteed Cost Function Network algorithms, Rosetta energy functions and Dunbrack's rotamer library, we developed and assessed EasyE and JayZ, two methods for binding affinity estimation that ignore or include conformational entropic contributions on a large benchmark of binding affinity experimental measures. If both approaches outperform most established tools, we observe that side-chain conformational entropy brings little or no improvement on most systems but becomes crucial in some rare cases. as open-source Python/C ++ code at sourcesup.renater.fr/projects/easy-jayz. thomas.schiex@inra.fr and sophie.barbe@insa-toulouse.fr. Supplementary data are available at Bioinformatics online.
Chira, Camelia; Horvath, Dragos; Dumitrescu, D
2011-07-30
Proteins are complex structures made of amino acids having a fundamental role in the correct functioning of living cells. The structure of a protein is the result of the protein folding process. However, the general principles that govern the folding of natural proteins into a native structure are unknown. The problem of predicting a protein structure with minimum-energy starting from the unfolded amino acid sequence is a highly complex and important task in molecular and computational biology. Protein structure prediction has important applications in fields such as drug design and disease prediction. The protein structure prediction problem is NP-hard even in simplified lattice protein models. An evolutionary model based on hill-climbing genetic operators is proposed for protein structure prediction in the hydrophobic - polar (HP) model. Problem-specific search operators are implemented and applied using a steepest-ascent hill-climbing approach. Furthermore, the proposed model enforces an explicit diversification stage during the evolution in order to avoid local optimum. The main features of the resulting evolutionary algorithm - hill-climbing mechanism and diversification strategy - are evaluated in a set of numerical experiments for the protein structure prediction problem to assess their impact to the efficiency of the search process. Furthermore, the emerging consolidated model is compared to relevant algorithms from the literature for a set of difficult bidimensional instances from lattice protein models. The results obtained by the proposed algorithm are promising and competitive with those of related methods.
De Novo Computational Design of Retro-Aldol Enzymes
Jiang, Lin; Althoff, Eric A.; Clemente, Fernando R.; Doyle, Lindsey; Röthlisberger, Daniela; Zanghellini, Alexandre; Gallaher, Jasmine L.; Betker, Jamie L.; Tanaka, Fujie; Barbas, Carlos F.; Hilvert, Donald; Houk, Kendall N.; Stoddard, Barry L.; Baker, David
2012-01-01
The creation of enzymes capable of catalyzing any desired chemical reaction is a grand challenge for computational protein design. Using new algorithms that rely on hashing techniques to construct active sites for multistep reactions, we designed retro-aldolases that use four different catalytic motifs to catalyze the breaking of a carbon-carbon bond in a nonnatural substrate. Of the 72 designs that were experimentally characterized, 32, spanning a range of protein folds, had detectable retro-aldolase activity. Designs that used an explicit water molecule to mediate proton shuffling were significantly more successful, with rate accelerations of up to four orders of magnitude and multiple turnovers, than those involving charged side-chain networks. The atomic accuracy of the design process was confirmed by the x-ray crystal structure of active designs embedded in two protein scaffolds, both of which were nearly superimposable on the design model. PMID:18323453
A Particle Swarm Optimization-Based Approach with Local Search for Predicting Protein Folding.
Yang, Cheng-Hong; Lin, Yu-Shiun; Chuang, Li-Yeh; Chang, Hsueh-Wei
2017-10-01
The hydrophobic-polar (HP) model is commonly used for predicting protein folding structures and hydrophobic interactions. This study developed a particle swarm optimization (PSO)-based algorithm combined with local search algorithms; specifically, the high exploration PSO (HEPSO) algorithm (which can execute global search processes) was combined with three local search algorithms (hill-climbing algorithm, greedy algorithm, and Tabu table), yielding the proposed HE-L-PSO algorithm. By using 20 known protein structures, we evaluated the performance of the HE-L-PSO algorithm in predicting protein folding in the HP model. The proposed HE-L-PSO algorithm exhibited favorable performance in predicting both short and long amino acid sequences with high reproducibility and stability, compared with seven reported algorithms. The HE-L-PSO algorithm yielded optimal solutions for all predicted protein folding structures. All HE-L-PSO-predicted protein folding structures possessed a hydrophobic core that is similar to normal protein folding.
Li, Dong; Pan, Zhisong; Hu, Guyu; Zhu, Zexuan; He, Shan
2017-03-14
Active modules are connected regions in biological network which show significant changes in expression over particular conditions. The identification of such modules is important since it may reveal the regulatory and signaling mechanisms that associate with a given cellular response. In this paper, we propose a novel active module identification algorithm based on a memetic algorithm. We propose a novel encoding/decoding scheme to ensure the connectedness of the identified active modules. Based on the scheme, we also design and incorporate a local search operator into the memetic algorithm to improve its performance. The effectiveness of proposed algorithm is validated on both small and large protein interaction networks.
AUTOBA: automation of backbone assignment from HN(C)N suite of experiments.
Borkar, Aditi; Kumar, Dinesh; Hosur, Ramakrishna V
2011-07-01
Development of efficient strategies and automation represent important milestones of progress in rapid structure determination efforts in proteomics research. In this context, we present here an efficient algorithm named as AUTOBA (Automatic Backbone Assignment) designed to automate the assignment protocol based on HN(C)N suite of experiments. Depending upon the spectral dispersion, the user can record 2D or 3D versions of the experiments for assignment. The algorithm uses as inputs: (i) protein primary sequence and (ii) peak-lists from user defined HN(C)N suite of experiments. In the end, one gets H(N), (15)N, C(α) and C' assignments (in common BMRB format) for the individual residues along the polypeptide chain. The success of the algorithm has been demonstrated, not only with experimental spectra recorded on two small globular proteins: ubiquitin (76 aa) and M-crystallin (85 aa), but also with simulated spectra of 27 other proteins using assignment data from the BMRB.
DNA-binding specificity prediction with FoldX.
Nadra, Alejandro D; Serrano, Luis; Alibés, Andreu
2011-01-01
With the advent of Synthetic Biology, a field between basic science and applied engineering, new computational tools are needed to help scientists reach their goal, their design, optimizing resources. In this chapter, we present a simple and powerful method to either know the DNA specificity of a wild-type protein or design new specificities by using the protein design algorithm FoldX. The only basic requirement is having a good resolution structure of the complex. Protein-DNA interaction design may aid the development of new parts designed to be orthogonal, decoupled, and precise in its target. Further, it could help to fine-tune the systems in terms of specificity, discrimination, and binding constants. In the age of newly developed devices and invented systems, computer-aided engineering promises to be an invaluable tool. Copyright © 2011 Elsevier Inc. All rights reserved.
Impact of computational structure-based methods on drug discovery.
Reynolds, Charles H
2014-01-01
Structure-based drug design has become an indispensible tool in drug discovery. The emergence of structure-based design is due to gains in structural biology that have provided exponential growth in the number of protein crystal structures, new computational algorithms and approaches for modeling protein-ligand interactions, and the tremendous growth of raw computer power in the last 30 years. Computer modeling and simulation have made major contributions to the discovery of many groundbreaking drugs in recent years. Examples are presented that highlight the evolution of computational structure-based design methodology, and the impact of that methodology on drug discovery.
Jian, Jhih-Wei; Elumalai, Pavadai; Pitti, Thejkiran; Wu, Chih Yuan; Tsai, Keng-Chang; Chang, Jeng-Yih; Peng, Hung-Pin; Yang, An-Suei
2016-01-01
Predicting ligand binding sites (LBSs) on protein structures, which are obtained either from experimental or computational methods, is a useful first step in functional annotation or structure-based drug design for the protein structures. In this work, the structure-based machine learning algorithm ISMBLab-LIG was developed to predict LBSs on protein surfaces with input attributes derived from the three-dimensional probability density maps of interacting atoms, which were reconstructed on the query protein surfaces and were relatively insensitive to local conformational variations of the tentative ligand binding sites. The prediction accuracy of the ISMBLab-LIG predictors is comparable to that of the best LBS predictors benchmarked on several well-established testing datasets. More importantly, the ISMBLab-LIG algorithm has substantial tolerance to the prediction uncertainties of computationally derived protein structure models. As such, the method is particularly useful for predicting LBSs not only on experimental protein structures without known LBS templates in the database but also on computationally predicted model protein structures with structural uncertainties in the tentative ligand binding sites. PMID:27513851
Layers: A molecular surface peeling algorithm and its applications to analyze protein structures
Karampudi, Naga Bhushana Rao; Bahadur, Ranjit Prasad
2015-01-01
We present an algorithm ‘Layers’ to peel the atoms of proteins as layers. Using Layers we show an efficient way to transform protein structures into 2D pattern, named residue transition pattern (RTP), which is independent of molecular orientations. RTP explains the folding patterns of proteins and hence identification of similarity between proteins is simple and reliable using RTP than with the standard sequence or structure based methods. Moreover, Layers generates a fine-tunable coarse model for the molecular surface by using non-random sampling. The coarse model can be used for shape comparison, protein recognition and ligand design. Additionally, Layers can be used to develop biased initial configuration of molecules for protein folding simulations. We have developed a random forest classifier to predict the RTP of a given polypeptide sequence. Layers is a standalone application; however, it can be merged with other applications to reduce the computational load when working with large datasets of protein structures. Layers is available freely at http://www.csb.iitkgp.ernet.in/applications/mol_layers/main. PMID:26553411
Sable, Rushikesh; Jois, Seetharama
2015-06-23
Blocking protein-protein interactions (PPI) using small molecules or peptides modulates biochemical pathways and has therapeutic significance. PPI inhibition for designing drug-like molecules is a new area that has been explored extensively during the last decade. Considering the number of available PPI inhibitor databases and the limited number of 3D structures available for proteins, docking and scoring methods play a major role in designing PPI inhibitors as well as stabilizers. Docking methods are used in the design of PPI inhibitors at several stages of finding a lead compound, including modeling the protein complex, screening for hot spots on the protein-protein interaction interface and screening small molecules or peptides that bind to the PPI interface. There are three major challenges to the use of docking on the relatively flat surfaces of PPI. In this review we will provide some examples of the use of docking in PPI inhibitor design as well as its limitations. The combination of experimental and docking methods with improved scoring function has thus far resulted in few success stories of PPI inhibitors for therapeutic purposes. Docking algorithms used for PPI are in the early stages, however, and as more data are available docking will become a highly promising area in the design of PPI inhibitors or stabilizers.
Automated design of degenerate codon libraries.
Mena, Marco A; Daugherty, Patrick S
2005-12-01
Degenerate codon libraries are frequently used in protein engineering and evolution studies but are often limited to targeting a small number of positions to adequately limit the search space. To mitigate this, codon degeneracy can be limited using heuristics or previous knowledge of the targeted positions. To automate design of libraries given a set of amino acid sequences, an algorithm (LibDesign) was developed that generates a set of possible degenerate codon libraries, their resulting size, and their score relative to a user-defined scoring function. A gene library of a specified size can then be constructed that is representative of the given amino acid distribution or that includes specific sequences or combinations thereof. LibDesign provides a new tool for automated design of high-quality protein libraries that more effectively harness existing sequence-structure information derived from multiple sequence alignment or computational protein design data.
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.
Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard
2015-02-01
Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
Lu, Stephen M.; Lu, Wuyuan; Qasim, M. A.; Anderson, Stephen; Apostol, Izydor; Ardelt, Wojciech; Bigler, Theresa; Chiang, Yi Wen; Cook, James; James, Michael N. G.; Kato, Ikunoshin; Kelly, Clyde; Kohr, William; Komiyama, Tomoko; Lin, Tiao-Yin; Ogawa, Michio; Otlewski, Jacek; Park, Soon-Jae; Qasim, Sabiha; Ranjbar, Michael; Tashiro, Misao; Warne, Nicholas; Whatley, Harry; Wieczorek, Anna; Wieczorek, Maciej; Wilusz, Tadeusz; Wynn, Richard; Zhang, Wenlei; Laskowski, Michael
2001-01-01
An additivity-based sequence to reactivity algorithm for the interaction of members of the Kazal family of protein inhibitors with six selected serine proteinases is described. Ten consensus variable contact positions in the inhibitor were identified, and the 19 possible variants at each of these positions were expressed. The free energies of interaction of these variants and the wild type were measured. For an additive system, this data set allows for the calculation of all possible sequences, subject to some restrictions. The algorithm was extensively tested. It is exceptionally fast so that all possible sequences can be predicted. The strongest, the most specific possible, and the least specific inhibitors were designed, and an evolutionary problem was solved. PMID:11171964
Strop, P.; Marinescu, A. M.; Mayo, S. L.
2000-01-01
Six helix surface positions of protein G (Gbeta1) were redesigned using a computational protein design algorithm, resulting in the five fold mutant Gbeta1m2. Gbeta1m2 is well folded with a circular dichroism spectrum nearly identical to that of Gbeta1, and a melting temperature of 91 degrees C, approximately 6 degrees C higher than that of Gbeta1. The crystal structure of Gbeta1m2 was solved to 2.0 A resolution by molecular replacement. The absence of hydrogen bond or salt bridge interactions between the designed residues in Gbeta1m2 suggests that the increased stability of Gbeta1m2 is due to increased helix propensity and more favorable helix dipole interactions. PMID:10933505
Shotgun Protein Sequencing with Meta-contig Assembly*
Guthals, Adrian; Clauser, Karl R.; Bandeira, Nuno
2012-01-01
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings. PMID:22798278
Shotgun protein sequencing with meta-contig assembly.
Guthals, Adrian; Clauser, Karl R; Bandeira, Nuno
2012-10-01
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.
Zeng, Jianyang; Zhou, Pei; Donald, Bruce Randall
2011-01-01
One bottleneck in NMR structure determination lies in the laborious and time-consuming process of side-chain resonance and NOE assignments. Compared to the well-studied backbone resonance assignment problem, automated side-chain resonance and NOE assignments are relatively less explored. Most NOE assignment algorithms require nearly complete side-chain resonance assignments from a series of through-bond experiments such as HCCH-TOCSY or HCCCONH. Unfortunately, these TOCSY experiments perform poorly on large proteins. To overcome this deficiency, we present a novel algorithm, called NASCA (NOE Assignment and Side-Chain Assignment), to automate both side-chain resonance and NOE assignments and to perform high-resolution protein structure determination in the absence of any explicit through-bond experiment to facilitate side-chain resonance assignment, such as HCCH-TOCSY. After casting the assignment problem into a Markov Random Field (MRF), NASCA extends and applies combinatorial protein design algorithms to compute optimal assignments that best interpret the NMR data. The MRF captures the contact map information of the protein derived from NOESY spectra, exploits the backbone structural information determined by RDCs, and considers all possible side-chain rotamers. The complexity of the combinatorial search is reduced by using a dead-end elimination (DEE) algorithm, which prunes side-chain resonance assignments that are provably not part of the optimal solution. Then an A* search algorithm is employed to find a set of optimal side-chain resonance assignments that best fit the NMR data. These side-chain resonance assignments are then used to resolve the NOE assignment ambiguity and compute high-resolution protein structures. Tests on five proteins show that NASCA assigns resonances for more than 90% of side-chain protons, and achieves about 80% correct assignments. The final structures computed using the NOE distance restraints assigned by NASCA have backbone RMSD 0.8 – 1.5 Å from the reference structures determined by traditional NMR approaches. PMID:21706248
Goldenzweig, Adi; Goldsmith, Moshe; Hill, Shannon E; Gertman, Or; Laurino, Paola; Ashani, Yacov; Dym, Orly; Unger, Tamar; Albeck, Shira; Prilusky, Jaime; Lieberman, Raquel L; Aharoni, Amir; Silman, Israel; Sussman, Joel L; Tawfik, Dan S; Fleishman, Sarel J
2016-07-21
Upon heterologous overexpression, many proteins misfold or aggregate, thus resulting in low functional yields. Human acetylcholinesterase (hAChE), an enzyme mediating synaptic transmission, is a typical case of a human protein that necessitates mammalian systems to obtain functional expression. We developed a computational strategy and designed an AChE variant bearing 51 mutations that improved core packing, surface polarity, and backbone rigidity. This variant expressed at ∼2,000-fold higher levels in E. coli compared to wild-type hAChE and exhibited 20°C higher thermostability with no change in enzymatic properties or in the active-site configuration as determined by crystallography. To demonstrate broad utility, we similarly designed four other human and bacterial proteins. Testing at most three designs per protein, we obtained enhanced stability and/or higher yields of soluble and active protein in E. coli. Our algorithm requires only a 3D structure and several dozen sequences of naturally occurring homologs, and is available at http://pross.weizmann.ac.il. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
Pandini, Alessandro; Fraccalvieri, Domenico; Bonati, Laura
2013-01-01
The biological function of proteins is strictly related to their molecular flexibility and dynamics: enzymatic activity, protein-protein interactions, ligand binding and allosteric regulation are important mechanisms involving protein motions. Computational approaches, such as Molecular Dynamics (MD) simulations, are now routinely used to study the intrinsic dynamics of target proteins as well as to complement molecular docking approaches. These methods have also successfully supported the process of rational design and discovery of new drugs. Identification of functionally relevant conformations is a key step in these studies. This is generally done by cluster analysis of the ensemble of structures in the MD trajectory. Recently Artificial Neural Network (ANN) approaches, in particular methods based on Self-Organising Maps (SOMs), have been reported performing more accurately and providing more consistent results than traditional clustering algorithms in various data-mining problems. In the specific case of conformational analysis, SOMs have been successfully used to compare multiple ensembles of protein conformations demonstrating a potential in efficiently detecting the dynamic signatures central to biological function. Moreover, examples of the use of SOMs to address problems relevant to other stages of the drug-design process, including clustering of docking poses, have been reported. In this contribution we review recent applications of ANN algorithms in analysing conformational and structural ensembles and we discuss their potential in computer-based approaches for medicinal chemistry.
In-Vivo Real-Time Control of Protein Expression from Endogenous and Synthetic Gene Networks
Orabona, Emanuele; De Stefano, Luca; Ferry, Mike; Hasty, Jeff; di Bernardo, Mario; di Bernardo, Diego
2014-01-01
We describe an innovative experimental and computational approach to control the expression of a protein in a population of yeast cells. We designed a simple control algorithm to automatically regulate the administration of inducer molecules to the cells by comparing the actual protein expression level in the cell population with the desired expression level. We then built an automated platform based on a microfluidic device, a time-lapse microscopy apparatus, and a set of motorized syringes, all controlled by a computer. We tested the platform to force yeast cells to express a desired fixed, or time-varying, amount of a reporter protein over thousands of minutes. The computer automatically switched the type of sugar administered to the cells, its concentration and its duration, according to the control algorithm. Our approach can be used to control expression of any protein, fused to a fluorescent reporter, provided that an external molecule known to (indirectly) affect its promoter activity is available. PMID:24831205
Discriminative structural approaches for enzyme active-site prediction.
Kato, Tsuyoshi; Nagano, Nozomi
2011-02-15
Predicting enzyme active-sites in proteins is an important issue not only for protein sciences but also for a variety of practical applications such as drug design. Because enzyme reaction mechanisms are based on the local structures of enzyme active-sites, various template-based methods that compare local structures in proteins have been developed to date. In comparing such local sites, a simple measurement, RMSD, has been used so far. This paper introduces new machine learning algorithms that refine the similarity/deviation for comparison of local structures. The similarity/deviation is applied to two types of applications, single template analysis and multiple template analysis. In the single template analysis, a single template is used as a query to search proteins for active sites, whereas a protein structure is examined as a query to discover the possible active-sites using a set of templates in the multiple template analysis. This paper experimentally illustrates that the machine learning algorithms effectively improve the similarity/deviation measurements for both the analyses.
NASA Astrophysics Data System (ADS)
Lerner, Michael G.; Meagher, Kristin L.; Carlson, Heather A.
2008-10-01
Use of solvent mapping, based on multiple-copy minimization (MCM) techniques, is common in structure-based drug discovery. The minima of small-molecule probes define locations for complementary interactions within a binding pocket. Here, we present improved methods for MCM. In particular, a Jarvis-Patrick (JP) method is outlined for grouping the final locations of minimized probes into physical clusters. This algorithm has been tested through a study of protein-protein interfaces, showing the process to be robust, deterministic, and fast in the mapping of protein "hot spots." Improvements in the initial placement of probe molecules are also described. A final application to HIV-1 protease shows how our automated technique can be used to partition data too complicated to analyze by hand. These new automated methods may be easily and quickly extended to other protein systems, and our clustering methodology may be readily incorporated into other clustering packages.
A traveling salesman approach for predicting protein functions.
Johnson, Olin; Liu, Jing
2006-10-12
Protein-protein interaction information can be used to predict unknown protein functions and to help study biological pathways. Here we present a new approach utilizing the classic Traveling Salesman Problem to study the protein-protein interactions and to predict protein functions in budding yeast Saccharomyces cerevisiae. We apply the global optimization tool from combinatorial optimization algorithms to cluster the yeast proteins based on the global protein interaction information. We then use this clustering information to help us predict protein functions. We use our algorithm together with the direct neighbor algorithm 1 on characterized proteins and compare the prediction accuracy of the two methods. We show our algorithm can produce better predictions than the direct neighbor algorithm, which only considers the immediate neighbors of the query protein. Our method is a promising one to be used as a general tool to predict functions of uncharacterized proteins and a successful sample of using computer science knowledge and algorithms to study biological problems.
A traveling salesman approach for predicting protein functions
Johnson, Olin; Liu, Jing
2006-01-01
Background Protein-protein interaction information can be used to predict unknown protein functions and to help study biological pathways. Results Here we present a new approach utilizing the classic Traveling Salesman Problem to study the protein-protein interactions and to predict protein functions in budding yeast Saccharomyces cerevisiae. We apply the global optimization tool from combinatorial optimization algorithms to cluster the yeast proteins based on the global protein interaction information. We then use this clustering information to help us predict protein functions. We use our algorithm together with the direct neighbor algorithm [1] on characterized proteins and compare the prediction accuracy of the two methods. We show our algorithm can produce better predictions than the direct neighbor algorithm, which only considers the immediate neighbors of the query protein. Conclusion Our method is a promising one to be used as a general tool to predict functions of uncharacterized proteins and a successful sample of using computer science knowledge and algorithms to study biological problems. PMID:17147783
Improved packing of protein side chains with parallel ant colonies.
Quan, Lijun; Lü, Qiang; Li, Haiou; Xia, Xiaoyan; Wu, Hongjie
2014-01-01
The accurate packing of protein side chains is important for many computational biology problems, such as ab initio protein structure prediction, homology modelling, and protein design and ligand docking applications. Many of existing solutions are modelled as a computational optimisation problem. As well as the design of search algorithms, most solutions suffer from an inaccurate energy function for judging whether a prediction is good or bad. Even if the search has found the lowest energy, there is no certainty of obtaining the protein structures with correct side chains. We present a side-chain modelling method, pacoPacker, which uses a parallel ant colony optimisation strategy based on sharing a single pheromone matrix. This parallel approach combines different sources of energy functions and generates protein side-chain conformations with the lowest energies jointly determined by the various energy functions. We further optimised the selected rotamers to construct subrotamer by rotamer minimisation, which reasonably improved the discreteness of the rotamer library. We focused on improving the accuracy of side-chain conformation prediction. For a testing set of 442 proteins, 87.19% of X1 and 77.11% of X12 angles were predicted correctly within 40° of the X-ray positions. We compared the accuracy of pacoPacker with state-of-the-art methods, such as CIS-RR and SCWRL4. We analysed the results from different perspectives, in terms of protein chain and individual residues. In this comprehensive benchmark testing, 51.5% of proteins within a length of 400 amino acids predicted by pacoPacker were superior to the results of CIS-RR and SCWRL4 simultaneously. Finally, we also showed the advantage of using the subrotamers strategy. All results confirmed that our parallel approach is competitive to state-of-the-art solutions for packing side chains. This parallel approach combines various sources of searching intelligence and energy functions to pack protein side chains. It provides a frame-work for combining different inaccuracy/usefulness objective functions by designing parallel heuristic search algorithms.
Structure based re-design of the binding specificity of anti-apoptotic Bcl-xL
Chen, T. Scott; Palacios, Hector; Keating, Amy E.
2012-01-01
Many native proteins are multi-specific and interact with numerous partners, which can confound analysis of their functions. Protein design provides a potential route to generating synthetic variants of native proteins with more selective binding profiles. Re-designed proteins could be used as research tools, diagnostics or therapeutics. In this work, we used a library screening approach to re-engineer the multi-specific anti-apoptotic protein Bcl-xL to remove its interactions with many of its binding partners, making it a high affinity and selective binder of the BH3 region of pro-apoptotic protein Bad. To overcome the enormity of the potential Bcl-xL sequence space, we developed and applied a computational/experimental framework that used protein structure information to generate focused combinatorial libraries. Sequence features were identified using structure-based modeling, and an optimization algorithm based on integer programming was used to select degenerate codons that maximally covered these features. A constraint on library size was used to ensure thorough sampling. Using yeast surface display to screen a designed library of Bcl-xL variants, we successfully identified a protein with ~1,000-fold improvement in binding specificity for the BH3 region of Bad over the BH3 region of Bim. Although negative design was targeted only against the BH3 region of Bim, the best re-designed protein was globally specific against binding to 10 other peptides corresponding to native BH3 motifs. Our design framework demonstrates an efficient route to highly specific protein binders and may readily be adapted for application to other design problems. PMID:23154169
A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Liu, Yu-Cheng; Yang, Meng-Han; Lin, Win-Li; Huang, Chien-Kang; Oyang, Yen-Jen
2009-12-03
Proteins are dynamic macromolecules which may undergo conformational transitions upon changes in environment. As it has been observed in laboratories that protein flexibility is correlated to essential biological functions, scientists have been designing various types of predictors for identifying structurally flexible regions in proteins. In this respect, there are two major categories of predictors. One category of predictors attempts to identify conformationally flexible regions through analysis of protein tertiary structures. Another category of predictors works completely based on analysis of the polypeptide sequences. As the availability of protein tertiary structures is generally limited, the design of predictors that work completely based on sequence information is crucial for advances of molecular biology research. In this article, we propose a novel approach to design a sequence-based predictor for identifying conformationally ambivalent regions in proteins. The novelty in the design stems from incorporating two classifiers based on two distinctive supervised learning algorithms that provide complementary prediction powers. Experimental results show that the overall performance delivered by the hybrid predictor proposed in this article is superior to the performance delivered by the existing predictors. Furthermore, the case study presented in this article demonstrates that the proposed hybrid predictor is capable of providing the biologists with valuable clues about the functional sites in a protein chain. The proposed hybrid predictor provides the users with two optional modes, namely, the high-sensitivity mode and the high-specificity mode. The experimental results with an independent testing data set show that the proposed hybrid predictor is capable of delivering sensitivity of 0.710 and specificity of 0.608 under the high-sensitivity mode, while delivering sensitivity of 0.451 and specificity of 0.787 under the high-specificity mode. Though experimental results show that the hybrid approach designed to exploit the complementary prediction powers of distinctive supervised learning algorithms works more effectively than conventional approaches, there exists a large room for further improvement with respect to the achieved performance. In this respect, it is of interest to investigate the effects of exploiting additional physiochemical properties that are related to conformational ambivalence. Furthermore, it is of interest to investigate the effects of incorporating lately-developed machine learning approaches, e.g. the random forest design and the multi-stage design. As conformational transition plays a key role in carrying out several essential types of biological functions, the design of more advanced predictors for identifying conformationally ambivalent regions in proteins deserves our continuous attention.
ProtaBank: A repository for protein design and engineering data.
Wang, Connie Y; Chang, Paul M; Ary, Marie L; Allen, Benjamin D; Chica, Roberto A; Mayo, Stephen L; Olafson, Barry D
2018-03-25
We present ProtaBank, a repository for storing, querying, analyzing, and sharing protein design and engineering data in an actively maintained and updated database. ProtaBank provides a format to describe and compare all types of protein mutational data, spanning a wide range of properties and techniques. It features a user-friendly web interface and programming layer that streamlines data deposition and allows for batch input and queries. The database schema design incorporates a standard format for reporting protein sequences and experimental data that facilitates comparison of results across different data sets. A suite of analysis and visualization tools are provided to facilitate discovery, to guide future designs, and to benchmark and train new predictive tools and algorithms. ProtaBank will provide a valuable resource to the protein engineering community by storing and safeguarding newly generated data, allowing for fast searching and identification of relevant data from the existing literature, and exploring correlations between disparate data sets. ProtaBank invites researchers to contribute data to the database to make it accessible for search and analysis. ProtaBank is available at https://protabank.org. © 2018 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Automatic design of decision-tree induction algorithms tailored to flexible-receptor docking data.
Barros, Rodrigo C; Winck, Ana T; Machado, Karina S; Basgalupp, Márcio P; de Carvalho, André C P L F; Ruiz, Duncan D; de Souza, Osmar Norberto
2012-11-21
This paper addresses the prediction of the free energy of binding of a drug candidate with enzyme InhA associated with Mycobacterium tuberculosis. This problem is found within rational drug design, where interactions between drug candidates and target proteins are verified through molecular docking simulations. In this application, it is important not only to correctly predict the free energy of binding, but also to provide a comprehensible model that could be validated by a domain specialist. Decision-tree induction algorithms have been successfully used in drug-design related applications, specially considering that decision trees are simple to understand, interpret, and validate. There are several decision-tree induction algorithms available for general-use, but each one has a bias that makes it more suitable for a particular data distribution. In this article, we propose and investigate the automatic design of decision-tree induction algorithms tailored to particular drug-enzyme binding data sets. We investigate the performance of our new method for evaluating binding conformations of different drug candidates to InhA, and we analyze our findings with respect to decision tree accuracy, comprehensibility, and biological relevance. The empirical analysis indicates that our method is capable of automatically generating decision-tree induction algorithms that significantly outperform the traditional C4.5 algorithm with respect to both accuracy and comprehensibility. In addition, we provide the biological interpretation of the rules generated by our approach, reinforcing the importance of comprehensible predictive models in this particular bioinformatics application. We conclude that automatically designing a decision-tree algorithm tailored to molecular docking data is a promising alternative for the prediction of the free energy from the binding of a drug candidate with a flexible-receptor.
Automatic design of decision-tree induction algorithms tailored to flexible-receptor docking data
2012-01-01
Background This paper addresses the prediction of the free energy of binding of a drug candidate with enzyme InhA associated with Mycobacterium tuberculosis. This problem is found within rational drug design, where interactions between drug candidates and target proteins are verified through molecular docking simulations. In this application, it is important not only to correctly predict the free energy of binding, but also to provide a comprehensible model that could be validated by a domain specialist. Decision-tree induction algorithms have been successfully used in drug-design related applications, specially considering that decision trees are simple to understand, interpret, and validate. There are several decision-tree induction algorithms available for general-use, but each one has a bias that makes it more suitable for a particular data distribution. In this article, we propose and investigate the automatic design of decision-tree induction algorithms tailored to particular drug-enzyme binding data sets. We investigate the performance of our new method for evaluating binding conformations of different drug candidates to InhA, and we analyze our findings with respect to decision tree accuracy, comprehensibility, and biological relevance. Results The empirical analysis indicates that our method is capable of automatically generating decision-tree induction algorithms that significantly outperform the traditional C4.5 algorithm with respect to both accuracy and comprehensibility. In addition, we provide the biological interpretation of the rules generated by our approach, reinforcing the importance of comprehensible predictive models in this particular bioinformatics application. Conclusions We conclude that automatically designing a decision-tree algorithm tailored to molecular docking data is a promising alternative for the prediction of the free energy from the binding of a drug candidate with a flexible-receptor. PMID:23171000
F2Dock: Fast Fourier Protein-Protein Docking
Bajaj, Chandrajit; Chowdhury, Rezaul; Siddavanahalli, Vinay
2009-01-01
The functions of proteins is often realized through their mutual interactions. Determining a relative transformation for a pair of proteins and their conformations which form a stable complex, reproducible in nature, is known as docking. It is an important step in drug design, structure determination and understanding function and structure relationships. In this paper we extend our non-uniform fast Fourier transform docking algorithm to include an adaptive search phase (both translational and rotational) and thereby speed up its execution. We have also implemented a multithreaded version of the adaptive docking algorithm for even faster execution on multicore machines. We call this protein-protein docking code F2Dock (F2 = Fast Fourier). We have calibrated F2Dock based on an extensive experimental study on a list of benchmark complexes and conclude that F2Dock works very well in practice. Though all docking results reported in this paper use shape complementarity and Coulombic potential based scores only, F2Dock is structured to incorporate Lennard-Jones potential and re-ranking docking solutions based on desolvation energy. PMID:21071796
Graph-based optimization of epitope coverage for vaccine antigen design
Theiler, James Patrick; Korber, Bette Tina Marie
2017-01-29
Epigraph is a recently developed algorithm that enables the computationally efficient design of single or multi-antigen vaccines to maximize the potential epitope coverage for a diverse pathogen population. Potential epitopes are defined as short contiguous stretches of proteins, comparable in length to T-cell epitopes. This optimal coverage problem can be formulated in terms of a directed graph, with candidate antigens represented as paths that traverse this graph. Epigraph protein sequences can also be used as the basis for designing peptides for experimental evaluation of immune responses in natural infections to highly variable proteins. The epigraph tool suite also enables rapidmore » characterization of populations of diverse sequences from an immunological perspective. Fundamental distance measures are based on immunologically relevant shared potential epitope frequencies, rather than simple Hamming or phylogenetic distances. Here, we provide a mathematical description of the epigraph algorithm, include a comparison of different heuristics that can be used when graphs are not acyclic, and we describe an additional tool we have added to the web-based epigraph tool suite that provides frequency summaries of all distinct potential epitopes in a population. Lastly, we also show examples of the graphical output and summary tables that can be generated using the epigraph tool suite and explain their content and applications.« less
Graph-based optimization of epitope coverage for vaccine antigen design
DOE Office of Scientific and Technical Information (OSTI.GOV)
Theiler, James Patrick; Korber, Bette Tina Marie
Epigraph is a recently developed algorithm that enables the computationally efficient design of single or multi-antigen vaccines to maximize the potential epitope coverage for a diverse pathogen population. Potential epitopes are defined as short contiguous stretches of proteins, comparable in length to T-cell epitopes. This optimal coverage problem can be formulated in terms of a directed graph, with candidate antigens represented as paths that traverse this graph. Epigraph protein sequences can also be used as the basis for designing peptides for experimental evaluation of immune responses in natural infections to highly variable proteins. The epigraph tool suite also enables rapidmore » characterization of populations of diverse sequences from an immunological perspective. Fundamental distance measures are based on immunologically relevant shared potential epitope frequencies, rather than simple Hamming or phylogenetic distances. Here, we provide a mathematical description of the epigraph algorithm, include a comparison of different heuristics that can be used when graphs are not acyclic, and we describe an additional tool we have added to the web-based epigraph tool suite that provides frequency summaries of all distinct potential epitopes in a population. Lastly, we also show examples of the graphical output and summary tables that can be generated using the epigraph tool suite and explain their content and applications.« less
DeepSite: protein-binding site predictor using 3D-convolutional neural networks.
Jiménez, J; Doerr, S; Martínez-Rosell, G; Rose, A S; De Fabritiis, G
2017-10-01
An important step in structure-based drug design consists in the prediction of druggable binding sites. Several algorithms for detecting binding cavities, those likely to bind to a small drug compound, have been developed over the years by clever exploitation of geometric, chemical and evolutionary features of the protein. Here we present a novel knowledge-based approach that uses state-of-the-art convolutional neural networks, where the algorithm is learned by examples. In total, 7622 proteins from the scPDB database of binding sites have been evaluated using both a distance and a volumetric overlap approach. Our machine-learning based method demonstrates superior performance to two other competitive algorithmic strategies. DeepSite is freely available at www.playmolecule.org. Users can submit either a PDB ID or PDB file for pocket detection to our NVIDIA GPU-equipped servers through a WebGL graphical interface. gianni.defabritiis@upf.edu. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Conformational stability as a design target to control protein aggregation.
Costanzo, Joseph A; O'Brien, Christopher J; Tiller, Kathryn; Tamargo, Erin; Robinson, Anne Skaja; Roberts, Christopher J; Fernandez, Erik J
2014-05-01
Non-native protein aggregation is a prevalent problem occurring in many biotechnological manufacturing processes and can compromise the biological activity of the target molecule or induce an undesired immune response. Additionally, some non-native aggregation mechanisms lead to amyloid fibril formation, which can be associated with debilitating diseases. For natively folded proteins, partial or complete unfolding is often required to populate aggregation-prone conformational states, and therefore one proposed strategy to mitigate aggregation is to increase the free energy for unfolding (ΔGunf) prior to aggregation. A computational design approach was tested using human γD crystallin (γD-crys) as a model multi-domain protein. Two mutational strategies were tested for their ability to reduce/increase aggregation rates by increasing/decreasing ΔGunf: stabilizing the less stable domain and stabilizing the domain-domain interface. The computational protein design algorithm, RosettaDesign, was implemented to identify point variants. The results showed that although the predicted free energies were only weakly correlated with the experimental ΔGunf values, increased/decreased aggregation rates for γD-crys correlated reasonably well with decreases/increases in experimental ΔGunf, illustrating improved conformational stability as a possible design target to mitigate aggregation. However, the results also illustrate that conformational stability is not the sole design factor controlling aggregation rates of natively folded proteins.
Genetically modified proteins: functional improvement and chimeragenesis
Balabanova, Larissa; Golotin, Vasily; Podvolotskaya, Anna; Rasskazov, Valery
2015-01-01
This review focuses on the emerging role of site-specific mutagenesis and chimeragenesis for the functional improvement of proteins in areas where traditional protein engineering methods have been extensively used and practically exhausted. The novel path for the creation of the novel proteins has been created on the farther development of the new structure and sequence optimization algorithms for generating and designing the accurate structure models in result of x-ray crystallography studies of a lot of proteins and their mutant forms. Artificial genetic modifications aim to expand nature's repertoire of biomolecules. One of the most exciting potential results of mutagenesis or chimeragenesis finding could be design of effective diagnostics, bio-therapeutics and biocatalysts. A sampling of recent examples is listed below for the in vivo and in vitro genetically improvement of various binding protein and enzyme functions, with references for more in-depth study provided for the reader's benefit. PMID:26211369
Parallel, stochastic measurement of molecular surface area.
Juba, Derek; Varshney, Amitabh
2008-08-01
Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
FPGA accelerator for protein secondary structure prediction based on the GOR algorithm
2011-01-01
Background Protein is an important molecule that performs a wide range of functions in biological systems. Recently, the protein folding attracts much more attention since the function of protein can be generally derived from its molecular structure. The GOR algorithm is one of the most successful computational methods and has been widely used as an efficient analysis tool to predict secondary structure from protein sequence. However, the execution time is still intolerable with the steep growth in protein database. Recently, FPGA chips have emerged as one promising application accelerator to accelerate bioinformatics algorithms by exploiting fine-grained custom design. Results In this paper, we propose a complete fine-grained parallel hardware implementation on FPGA to accelerate the GOR-IV package for 2D protein structure prediction. To improve computing efficiency, we partition the parameter table into small segments and access them in parallel. We aggressively exploit data reuse schemes to minimize the need for loading data from external memory. The whole computation structure is carefully pipelined to overlap the sequence loading, computing and back-writing operations as much as possible. We implemented a complete GOR desktop system based on an FPGA chip XC5VLX330. Conclusions The experimental results show a speedup factor of more than 430x over the original GOR-IV version and 110x speedup over the optimized version with multi-thread SIMD implementation running on a PC platform with AMD Phenom 9650 Quad CPU for 2D protein structure prediction. However, the power consumption is only about 30% of that of current general-propose CPUs. PMID:21342582
RNA design using simulated SHAPE data.
Lotfi, Mohadeseh; Zare-Mirakabad, Fatemeh; Montaseri, Soheila
2018-05-03
It has long been established that in addition to being involved in protein translation, RNA plays essential roles in numerous other cellular processes, including gene regulation and DNA replication. Such roles are known to be dictated by higher-order structures of RNA molecules. It is therefore of prime importance to find an RNA sequence that can fold to acquire a particular function that is desirable for use in pharmaceuticals and basic research. The challenge of finding an RNA sequence for a given structure is known as the RNA design problem. Although there are several algorithms to solve this problem, they mainly consider hard constraints, such as minimum free energy, to evaluate the predicted sequences. Recently, SHAPE data has emerged as a new soft constraint for RNA secondary structure prediction. To take advantage of this new experimental constraint, we report here a new method for accurate design of RNA sequences based on their secondary structures using SHAPE data as pseudo-free energy. We then compare our algorithm with four others: INFO-RNA, ERD, MODENA and RNAifold 2.0. Our algorithm precisely predicts 26 out of 29 new sequences for the structures extracted from the Rfam dataset, while the other four algorithms predict no more than 22 out of 29. The proposed algorithm is comparable to the above algorithms on RNA-SSD datasets, where they can predict up to 33 appropriate sequences for RNA secondary structures out of 34.
Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P
2007-03-15
Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
MINE: Module Identification in Networks
2011-01-01
Background Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks. Results MINE outperforms MCODE, CFinder, NEMO, SPICi, and MCL in identifying non-exclusive, high modularity clusters when applied to the C. elegans protein-protein interaction network. The algorithm generally achieves superior geometric accuracy and modularity for annotated functional categories. In comparison with the most closely related algorithm, MCODE, the top clusters identified by MINE are consistently of higher density and MINE is less likely to designate overlapping modules as a single unit. MINE offers a high level of granularity with a small number of adjustable parameters, enabling users to fine-tune cluster results for input networks with differing topological properties. Conclusions MINE was created in response to the challenge of discovering high quality modules of gene products within highly interconnected biological networks. The algorithm allows a high degree of flexibility and user-customisation of results with few adjustable parameters. MINE outperforms several popular clustering algorithms in identifying modules with high modularity and obtains good overall recall and precision of functional annotations in protein-protein interaction networks from both S. cerevisiae and C. elegans. PMID:21605434
Protein-Protein Docking in Drug Design and Discovery.
Kaczor, Agnieszka A; Bartuzi, Damian; Stępniewski, Tomasz Maciej; Matosiuk, Dariusz; Selent, Jana
2018-01-01
Protein-protein interactions (PPIs) are responsible for a number of key physiological processes in the living cells and underlie the pathomechanism of many diseases. Nowadays, along with the concept of so-called "hot spots" in protein-protein interactions, which are well-defined interface regions responsible for most of the binding energy, these interfaces can be targeted with modulators. In order to apply structure-based design techniques to design PPIs modulators, a three-dimensional structure of protein complex has to be available. In this context in silico approaches, in particular protein-protein docking, are a valuable complement to experimental methods for elucidating 3D structure of protein complexes. Protein-protein docking is easy to use and does not require significant computer resources and time (in contrast to molecular dynamics) and it results in 3D structure of a protein complex (in contrast to sequence-based methods of predicting binding interfaces). However, protein-protein docking cannot address all the aspects of protein dynamics, in particular the global conformational changes during protein complex formation. In spite of this fact, protein-protein docking is widely used to model complexes of water-soluble proteins and less commonly to predict structures of transmembrane protein assemblies, including dimers and oligomers of G protein-coupled receptors (GPCRs). In this chapter we review the principles of protein-protein docking, available algorithms and software and discuss the recent examples, benefits, and drawbacks of protein-protein docking application to water-soluble proteins, membrane anchoring and transmembrane proteins, including GPCRs.
Arnold, Roland; Goldenberg, Florian; Mewes, Hans-Werner; Rattei, Thomas
2014-01-01
The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads. PMID:24165881
ASPeak: an abundance sensitive peak detection algorithm for RIP-Seq.
Kucukural, Alper; Özadam, Hakan; Singh, Guramrit; Moore, Melissa J; Cenik, Can
2013-10-01
Unlike DNA, RNA abundances can vary over several orders of magnitude. Thus, identification of RNA-protein binding sites from high-throughput sequencing data presents unique challenges. Although peak identification in ChIP-Seq data has been extensively explored, there are few bioinformatics tools tailored for peak calling on analogous datasets for RNA-binding proteins. Here we describe ASPeak (abundance sensitive peak detection algorithm), an implementation of an algorithm that we previously applied to detect peaks in exon junction complex RNA immunoprecipitation in tandem experiments. Our peak detection algorithm yields stringent and robust target sets enabling sensitive motif finding and downstream functional analyses. ASPeak is implemented in Perl as a complete pipeline that takes bedGraph files as input. ASPeak implementation is freely available at https://sourceforge.net/projects/as-peak under the GNU General Public License. ASPeak can be run on a personal computer, yet is designed to be easily parallelizable. ASPeak can also run on high performance computing clusters providing efficient speedup. The documentation and user manual can be obtained from http://master.dl.sourceforge.net/project/as-peak/manual.pdf.
Second-order Poisson Nernst-Planck solver for ion channel transport
Zheng, Qiong; Chen, Duan; Wei, Guo-Wei
2010-01-01
The Poisson Nernst-Planck (PNP) theory is a simplified continuum model for a wide variety of chemical, physical and biological applications. Its ability of providing quantitative explanation and increasingly qualitative predictions of experimental measurements has earned itself much recognition in the research community. Numerous computational algorithms have been constructed for the solution of the PNP equations. However, in the realistic ion-channel context, no second order convergent PNP algorithm has ever been reported in the literature, due to many numerical obstacles, including discontinuous coefficients, singular charges, geometric singularities, and nonlinear couplings. The present work introduces a number of numerical algorithms to overcome the abovementioned numerical challenges and constructs the first second-order convergent PNP solver in the ion-channel context. First, a Dirichlet to Neumann mapping (DNM) algorithm is designed to alleviate the charge singularity due to the protein structure. Additionally, the matched interface and boundary (MIB) method is reformulated for solving the PNP equations. The MIB method systematically enforces the interface jump conditions and achieves the second order accuracy in the presence of complex geometry and geometric singularities of molecular surfaces. Moreover, two iterative schemes are utilized to deal with the coupled nonlinear equations. Furthermore, extensive and rigorous numerical validations are carried out over a number of geometries, including a sphere, two proteins and an ion channel, to examine the numerical accuracy and convergence order of the present numerical algorithms. Finally, application is considered to a real transmembrane protein, the Gramicidin A channel protein. The performance of the proposed numerical techniques is tested against a number of factors, including mesh sizes, diffusion coefficient profiles, iterative schemes, ion concentrations, and applied voltages. Numerical predictions are compared with experimental measurements. PMID:21552336
A Markov Random Field Framework for Protein Side-Chain Resonance Assignment
NASA Astrophysics Data System (ADS)
Zeng, Jianyang; Zhou, Pei; Donald, Bruce Randall
Nuclear magnetic resonance (NMR) spectroscopy plays a critical role in structural genomics, and serves as a primary tool for determining protein structures, dynamics and interactions in physiologically-relevant solution conditions. The current speed of protein structure determination via NMR is limited by the lengthy time required in resonance assignment, which maps spectral peaks to specific atoms and residues in the primary sequence. Although numerous algorithms have been developed to address the backbone resonance assignment problem [68,2,10,37,14,64,1,31,60], little work has been done to automate side-chain resonance assignment [43, 48, 5]. Most previous attempts in assigning side-chain resonances depend on a set of NMR experiments that record through-bond interactions with side-chain protons for each residue. Unfortunately, these NMR experiments have low sensitivity and limited performance on large proteins, which makes it difficult to obtain enough side-chain resonance assignments. On the other hand, it is essential to obtain almost all of the side-chain resonance assignments as a prerequisite for high-resolution structure determination. To overcome this deficiency, we present a novel side-chain resonance assignment algorithm based on alternative NMR experiments measuring through-space interactions between protons in the protein, which also provide crucial distance restraints and are normally required in high-resolution structure determination. We cast the side-chain resonance assignment problem into a Markov Random Field (MRF) framework, and extend and apply combinatorial protein design algorithms to compute the optimal solution that best interprets the NMR data. Our MRF framework captures the contact map information of the protein derived from NMR spectra, and exploits the structural information available from the backbone conformations determined by orientational restraints and a set of discretized side-chain conformations (i.e., rotamers). A Hausdorff-based computation is employed in the scoring function to evaluate the probability of side-chain resonance assignments to generate the observed NMR spectra. The complexity of the assignment problem is first reduced by using a dead-end elimination (DEE) algorithm, which prunes side-chain resonance assignments that are provably not part of the optimal solution. Then an A* search algorithm is used to find a set of optimal side-chain resonance assignments that best fit the NMR data. We have tested our algorithm on NMR data for five proteins, including the FF Domain 2 of human transcription elongation factor CA150 (FF2), the B1 domain of Protein G (GB1), human ubiquitin, the ubiquitin-binding zinc finger domain of the human Y-family DNA polymerase Eta (pol η UBZ), and the human Set2-Rpb1 interacting domain (hSRI). Our algorithm assigns resonances for more than 90% of the protons in the proteins, and achieves about 80% correct side-chain resonance assignments. The final structures computed using distance restraints resulting from the set of assigned side-chain resonances have backbone RMSD 0.5 - 1.4 Å and all-heavy-atom RMSD 1.0 - 2.2 Å from the reference structures that were determined by X-ray crystallography or traditional NMR approaches. These results demonstrate that our algorithm can be successfully applied to automate side-chain resonance assignment and high-quality protein structure determination. Since our algorithm does not require any specific NMR experiments for measuring the through-bond interactions with side-chain protons, it can save a significant amount of both experimental cost and spectrometer time, and hence accelerate the NMR structure determination process.
KISS for STRAP: user extensions for a protein alignment editor.
Gille, Christoph; Lorenzen, Stephan; Michalsky, Elke; Frömmel, Cornelius
2003-12-12
The Structural Alignment Program STRAP is a comfortable comprehensive editor and analyzing tool for protein alignments. A wide range of functions related to protein sequences and protein structures are accessible with an intuitive graphical interface. Recent features include mapping of mutations and polymorphisms onto structures and production of high quality figures for publication. Here we address the general problem of multi-purpose program packages to keep up with the rapid development of bioinformatical methods and the demand for specific program functions. STRAP was remade implementing a novel design which aims at Keeping Interfaces in STRAP Simple (KISS). KISS renders STRAP extendable to bio-scientists as well as to bio-informaticians. Scientists with basic computer skills are capable of implementing statistical methods or embedding existing bioinformatical tools in STRAP themselves. For bio-informaticians STRAP may serve as an environment for rapid prototyping and testing of complex algorithms such as automatic alignment algorithms or phylogenetic methods. Further, STRAP can be applied as an interactive web applet to present data related to a particular protein family and as a teaching tool. JAVA-1.4 or higher. http://www.charite.de/bioinf/strap/
Implementation of a parallel protein structure alignment service on cloud.
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
Identification of structural domains in proteins by a graph heuristic.
Wernisch, L; Hunting, M; Wodak, S J
1999-05-15
A novel automatic procedure for identifying domains from protein atomic coordinates is presented. The procedure, termed STRUDL (STRUctural Domain Limits), does not take into account information on secondary structures and handles any number of domains made up of contiguous or non-contiguous chain segments. The core algorithm uses the Kernighan-Lin graph heuristic to partition the protein into residue sets which display minimum interactions between them. These interactions are deduced from the weighted Voronoi diagram. The generated partitions are accepted or rejected on the basis of optimized criteria, representing basic expected physical properties of structural domains. The graph heuristic approach is shown to be very effective, it approximates closely the exact solution provided by a branch and bound algorithm for a number of test proteins. In addition, the overall performance of STRUDL is assessed on a set of 787 representative proteins from the Protein Data Bank by comparison to domain definitions in the CATH protein classification. The domains assigned by STRUDL agree with the CATH assignments in at least 81% of the tested proteins. This result is comparable to that obtained previously using PUU (Holm and Sander, Proteins 1994;9:256-268), the only other available algorithm designed to identify domains with any number of non-contiguous chain segments. A detailed discussion of the structures for which our assignments differ from those in CATH brings to light some clear inconsistencies between the concept of structural domains based on minimizing inter-domain interactions and that of delimiting structural motifs that represent acceptable folding topologies or architectures. Considering both concepts as complementary and combining them in a layered approach might be the way forward.
Ho, Michelle L; Adler, Benjamin A; Torre, Michael L; Silberg, Jonathan J; Suh, Junghae
2013-12-20
Adeno-associated virus (AAV) recombination can result in chimeric capsid protein subunits whose ability to assemble into an oligomeric capsid, package a genome, and transduce cells depends on the inheritance of sequence from different AAV parents. To develop quantitative design principles for guiding site-directed recombination of AAV capsids, we have examined how capsid structural perturbations predicted by the SCHEMA algorithm correlate with experimental measurements of disruption in seventeen chimeric capsid proteins. In our small chimera population, created by recombining AAV serotypes 2 and 4, we found that protection of viral genomes and cellular transduction were inversely related to calculated disruption of the capsid structure. Interestingly, however, we did not observe a correlation between genome packaging and calculated structural disruption; a majority of the chimeric capsid proteins formed at least partially assembled capsids and more than half packaged genomes, including those with the highest SCHEMA disruption. These results suggest that the sequence space accessed by recombination of divergent AAV serotypes is rich in capsid chimeras that assemble into 60-mer capsids and package viral genomes. Overall, the SCHEMA algorithm may be useful for delineating quantitative design principles to guide the creation of libraries enriched in genome-protecting virus nanoparticles that can effectively transduce cells. Such improvements to the virus design process may help advance not only gene therapy applications but also other bionanotechnologies dependent upon the development of viruses with new sequences and functions.
Ho, Michelle L.; Adler, Benjamin A.; Torre, Michael L.; Silberg, Jonathan J.; Suh, Junghae
2013-01-01
Adeno-associated virus (AAV) recombination can result in chimeric capsid protein subunits whose ability to assemble into an oligomeric capsid, package a genome, and transduce cells depends on the inheritance of sequence from different AAV parents. To develop quantitative design principles for guiding site-directed recombination of AAV capsids, we have examined how capsid structural perturbations predicted by the SCHEMA algorithm correlate with experimental measurements of disruption in seventeen chimeric capsid proteins. In our small chimera population, created by recombining AAV serotypes 2 and 4, we found that protection of viral genomes and cellular transduction were inversely related to calculated disruption of the capsid structure. Interestingly, however, we did not observe a correlation between genome packaging and calculated structural disruption; a majority of the chimeric capsid proteins formed at least partially assembled capsids and more than half packaged genomes, including those with the highest SCHEMA disruption. These results suggest that the sequence space accessed by recombination of divergent AAV serotypes is rich in capsid chimeras that assemble into 60-mer capsids and package viral genomes. Overall, the SCHEMA algorithm may be useful for delineating quantitative design principles to guide the creation of libraries enriched in genome-protecting virus nanoparticles that can effectively transduce cells. Such improvements to the virus design process may help advance not only gene therapy applications, but also other bionanotechnologies dependent upon the development of viruses with new sequences and functions. PMID:23899192
Improved hybrid optimization algorithm for 3D protein structure prediction.
Zhou, Changjun; Hou, Caixia; Wei, Xiaopeng; Zhang, Qiang
2014-07-01
A new improved hybrid optimization algorithm - PGATS algorithm, which is based on toy off-lattice model, is presented for dealing with three-dimensional protein structure prediction problems. The algorithm combines the particle swarm optimization (PSO), genetic algorithm (GA), and tabu search (TS) algorithms. Otherwise, we also take some different improved strategies. The factor of stochastic disturbance is joined in the particle swarm optimization to improve the search ability; the operations of crossover and mutation that are in the genetic algorithm are changed to a kind of random liner method; at last tabu search algorithm is improved by appending a mutation operator. Through the combination of a variety of strategies and algorithms, the protein structure prediction (PSP) in a 3D off-lattice model is achieved. The PSP problem is an NP-hard problem, but the problem can be attributed to a global optimization problem of multi-extremum and multi-parameters. This is the theoretical principle of the hybrid optimization algorithm that is proposed in this paper. The algorithm combines local search and global search, which overcomes the shortcoming of a single algorithm, giving full play to the advantage of each algorithm. In the current universal standard sequences, Fibonacci sequences and real protein sequences are certified. Experiments show that the proposed new method outperforms single algorithms on the accuracy of calculating the protein sequence energy value, which is proved to be an effective way to predict the structure of proteins.
Development of sensor-based nitrogen recommendation algorithms for cereal crops
NASA Astrophysics Data System (ADS)
Asebedo, Antonio Ray
Nitrogen (N) management is one of the most recognizable components of farming both within and outside the world of agriculture. Interest over the past decade has greatly increased in improving N management systems in corn (Zea mays) and winter wheat (Triticum aestivum ) to have high NUE, high yield, and be environmentally sustainable. Nine winter wheat experiments were conducted across seven locations from 2011 through 2013. The objectives of this study were to evaluate the impacts of fall-winter, Feekes 4, Feekes 7, and Feekes 9 N applications on winter wheat grain yield, grain protein, and total grain N uptake. Nitrogen treatments were applied as single or split applications in the fall-winter, and top-dressed in the spring at Feekes 4, Feekes 7, and Feekes 9 with applied N rates ranging from 0 to 134 kg ha-1. Results indicate that Feekes 7 and 9 N applications provide more optimal combinations of grain yield, grain protein levels, and fertilizer N recovered in the grain when compared to comparable rates of N applied in the fall-winter or at Feekes 4. Winter wheat N management studies from 2006 through 2013 were utilized to develop sensor-based N recommendation algorithms for winter wheat in Kansas. Algorithm RosieKat v.2.6 was designed for multiple N application strategies and utilized N reference strips for establishing N response potential. Algorithm NRS v1.5 addressed single top-dress N applications and does not require a N reference strip. In 2013, field validations of both algorithms were conducted at eight locations across Kansas. Results show algorithm RK v2.6 consistently provided highly efficient N recommendations for improving NUE, while achieving high grain yield and grain protein. Without the use of the N reference strip, NRS v1.5 performed statistically equal to the KSU soil test N recommendation in regards to grain yield but with lower applied N rates. Six corn N fertigation experiments were conducted at KSU irrigated experiment fields from 2012 through 2014 to evaluate the previously developed KSU sensor-based N recommendation algorithm in corn N fertigation systems. Results indicate that the current KSU corn algorithm was effective at achieving high yields, but has the tendency to overestimate N requirements. To optimize sensor-based N recommendations for N fertigation systems, algorithms must be specifically designed for these systems to take advantage of their full capabilities, thus allowing implementation of high NUE N management systems.
Optimizing doped libraries by using genetic algorithms
NASA Astrophysics Data System (ADS)
Tomandl, Dirk; Schober, Andreas; Schwienhorst, Andreas
1997-01-01
The insertion of random sequences into protein-encoding genes in combination with biologicalselection techniques has become a valuable tool in the design of molecules that have usefuland possibly novel properties. By employing highly effective screening protocols, a functionaland unique structure that had not been anticipated can be distinguished among a hugecollection of inactive molecules that together represent all possible amino acid combinations.This technique is severely limited by its restriction to a library of manageable size. Oneapproach for limiting the size of a mutant library relies on `doping schemes', where subsetsof amino acids are generated that reveal only certain combinations of amino acids in a proteinsequence. Three mononucleotide mixtures for each codon concerned must be designed, suchthat the resulting codons that are assembled during chemical gene synthesis represent thedesired amino acid mixture on the level of the translated protein. In this paper we present adoping algorithm that `reverse translates' a desired mixture of certain amino acids into threemixtures of mononucleotides. The algorithm is designed to optimally bias these mixturestowards the codons of choice. This approach combines a genetic algorithm with localoptimization strategies based on the downhill simplex method. Disparate relativerepresentations of all amino acids (and stop codons) within a target set can be generated.Optional weighing factors are employed to emphasize the frequencies of certain amino acidsand their codon usage, and to compensate for reaction rates of different mononucleotidebuilding blocks (synthons) during chemical DNA synthesis. The effect of statistical errors thataccompany an experimental realization of calculated nucleotide mixtures on the generatedmixtures of amino acids is simulated. These simulations show that the robustness of differentoptima with respect to small deviations from calculated values depends on their concomitantfitness. Furthermore, the calculations probe the fitness landscape locally and allow apreliminary assessment of its structure.
Wallace, A. C.; Borkakoti, N.; Thornton, J. M.
1997-01-01
It is well established that sequence templates such as those in the PROSITE and PRINTS databases are powerful tools for predicting the biological function and tertiary structure for newly derived protein sequences. The number of X-ray and NMR protein structures is increasing rapidly and it is apparent that a 3D equivalent of the sequence templates is needed. Here, we describe an algorithm called TESS that automatically derives 3D templates from structures deposited in the Brookhaven Protein Data Bank. While a new sequence can be searched for sequence patterns, a new structure can be scanned against these 3D templates to identify functional sites. As examples, 3D templates are derived for enzymes with an O-His-O "catalytic triad" and for the ribonucleases and lysozymes. When these 3D templates are applied to a large data set of nonidentical proteins, several interesting hits are located. This suggests that the development of a 3D template database may help to identify the function of new protein structures, if unknown, as well as to design proteins with specific functions. PMID:9385633
An efficient algorithm for pairwise local alignment of protein interaction networks
Chen, Wenbin; Schmidt, Matthew; Tian, Wenhong; ...
2015-04-01
Recently, researchers seeking to understand, modify, and create beneficial traits in organisms have looked for evolutionarily conserved patterns of protein interactions. Their conservation likely means that the proteins of these conserved functional modules are important to the trait's expression. In this paper, we formulate the problem of identifying these conserved patterns as a graph optimization problem, and develop a fast heuristic algorithm for this problem. We compare the performance of our network alignment algorithm to that of the MaWISh algorithm [Koyuturk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A, Pairwise alignment of protein interaction networks, J Computmore » Biol 13(2): 182-199, 2006.], which bases its search algorithm on a related decision problem formulation. We find that our algorithm discovers conserved modules with a larger number of proteins in an order of magnitude less time. In conclusion, the protein sets found by our algorithm correspond to known conserved functional modules at comparable precision and recall rates as those produced by the MaWISh algorithm.« less
Improved packing of protein side chains with parallel ant colonies
2014-01-01
Introduction The accurate packing of protein side chains is important for many computational biology problems, such as ab initio protein structure prediction, homology modelling, and protein design and ligand docking applications. Many of existing solutions are modelled as a computational optimisation problem. As well as the design of search algorithms, most solutions suffer from an inaccurate energy function for judging whether a prediction is good or bad. Even if the search has found the lowest energy, there is no certainty of obtaining the protein structures with correct side chains. Methods We present a side-chain modelling method, pacoPacker, which uses a parallel ant colony optimisation strategy based on sharing a single pheromone matrix. This parallel approach combines different sources of energy functions and generates protein side-chain conformations with the lowest energies jointly determined by the various energy functions. We further optimised the selected rotamers to construct subrotamer by rotamer minimisation, which reasonably improved the discreteness of the rotamer library. Results We focused on improving the accuracy of side-chain conformation prediction. For a testing set of 442 proteins, 87.19% of X1 and 77.11% of X12 angles were predicted correctly within 40° of the X-ray positions. We compared the accuracy of pacoPacker with state-of-the-art methods, such as CIS-RR and SCWRL4. We analysed the results from different perspectives, in terms of protein chain and individual residues. In this comprehensive benchmark testing, 51.5% of proteins within a length of 400 amino acids predicted by pacoPacker were superior to the results of CIS-RR and SCWRL4 simultaneously. Finally, we also showed the advantage of using the subrotamers strategy. All results confirmed that our parallel approach is competitive to state-of-the-art solutions for packing side chains. Conclusions This parallel approach combines various sources of searching intelligence and energy functions to pack protein side chains. It provides a frame-work for combining different inaccuracy/usefulness objective functions by designing parallel heuristic search algorithms. PMID:25474164
Wang, Nanyi; Wang, Lirong; Xie, Xiang-Qun
2017-11-27
Molecular docking is widely applied to computer-aided drug design and has become relatively mature in the recent decades. Application of docking in modeling varies from single lead compound optimization to large-scale virtual screening. The performance of molecular docking is highly dependent on the protein structures selected. It is especially challenging for large-scale target prediction research when multiple structures are available for a single target. Therefore, we have established ProSelection, a docking preferred-protein selection algorithm, in order to generate the proper structure subset(s). By the ProSelection algorithm, protein structures of "weak selectors" are filtered out whereas structures of "strong selectors" are kept. Specifically, the structure which has a good statistical performance of distinguishing active ligands from inactive ligands is defined as a strong selector. In this study, 249 protein structures of 14 autophagy-related targets are investigated. Surflex-dock was used as the docking engine to distinguish active and inactive compounds against these protein structures. Both t test and Mann-Whitney U test were used to distinguish the strong from the weak selectors based on the normality of the docking score distribution. The suggested docking score threshold for active ligands (SDA) was generated for each strong selector structure according to the receiver operating characteristic (ROC) curve. The performance of ProSelection was further validated by predicting the potential off-targets of 43 U.S. Federal Drug Administration approved small molecule antineoplastic drugs. Overall, ProSelection will accelerate the computational work in protein structure selection and could be a useful tool for molecular docking, target prediction, and protein-chemical database establishment research.
Genetic algorithms for protein threading.
Yadgari, J; Amir, A; Unger, R
1998-01-01
Despite many years of efforts, a direct prediction of protein structure from sequence is still not possible. As a result, in the last few years researchers have started to address the "inverse folding problem": Identifying and aligning a sequence to the fold with which it is most compatible, a process known as "threading". In two meetings in which protein folding predictions were objectively evaluated, it became clear that threading as a concept promises a real breakthrough, but that much improvement is still needed in the technique itself. Threading is a NP-hard problem, and thus no general polynomial solution can be expected. Still a practical approach with demonstrated ability to find optimal solutions in many cases, and acceptable solutions in other cases, is needed. We applied the technique of Genetic Algorithms in order to significantly improve the ability of threading algorithms to find the optimal alignment of a sequence to a structure, i.e. the alignment with the minimum free energy. A major progress reported here is the design of a representation of the threading alignment as a string of fixed length. With this representation validation of alignments and genetic operators are effectively implemented. Appropriate data structure and parameters have been selected. It is shown that Genetic Algorithm threading is effective and is able to find the optimal alignment in a few test cases. Furthermore, the described algorithm is shown to perform well even without pre-definition of core elements. Existing threading methods are dependent on such constraints to make their calculations feasible. But the concept of core elements is inherently arbitrary and should be avoided if possible. While a rigorous proof is hard to submit yet an, we present indications that indeed Genetic Algorithm threading is capable of finding consistently good solutions of full alignments in search spaces of size up to 10(70).
He, Lu; Friedman, Alan M; Bailey-Kellogg, Chris
2012-03-01
In developing improved protein variants by site-directed mutagenesis or recombination, there are often competing objectives that must be considered in designing an experiment (selecting mutations or breakpoints): stability versus novelty, affinity versus specificity, activity versus immunogenicity, and so forth. Pareto optimal experimental designs make the best trade-offs between competing objectives. Such designs are not "dominated"; that is, no other design is better than a Pareto optimal design for one objective without being worse for another objective. Our goal is to produce all the Pareto optimal designs (the Pareto frontier), to characterize the trade-offs and suggest designs most worth considering, but to avoid explicitly considering the large number of dominated designs. To do so, we develop a divide-and-conquer algorithm, Protein Engineering Pareto FRontier (PEPFR), that hierarchically subdivides the objective space, using appropriate dynamic programming or integer programming methods to optimize designs in different regions. This divide-and-conquer approach is efficient in that the number of divisions (and thus calls to the optimizer) is directly proportional to the number of Pareto optimal designs. We demonstrate PEPFR with three protein engineering case studies: site-directed recombination for stability and diversity via dynamic programming, site-directed mutagenesis of interacting proteins for affinity and specificity via integer programming, and site-directed mutagenesis of a therapeutic protein for activity and immunogenicity via integer programming. We show that PEPFR is able to effectively produce all the Pareto optimal designs, discovering many more designs than previous methods. The characterization of the Pareto frontier provides additional insights into the local stability of design choices as well as global trends leading to trade-offs between competing criteria. Copyright © 2011 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Diaz, K. S.; Kim, E. H.; Jones, R. M.; de Leon, K. C.; Woodcroft, B. J.; Tyson, G. W.; Rich, V. I.
2014-12-01
The growing field of metaproteomics links microbial communities to their expressed functions by using mass spectrometry methods to characterize community proteins. Comparison of mass spectrometry protein search algorithms and their biases is crucial for maximizing the quality and amount of protein identifications in mass spectral data. Available algorithms employ different approaches when mapping mass spectra to peptides against a database. We compared mass spectra from four microbial proteomes derived from high-organic content soils searched with two search algorithms: 1) Sequest HT as packaged within Proteome Discoverer (v.1.4) and 2) X!Tandem as packaged in TransProteomicPipeline (v.4.7.1). Searches used matched metagenomes, and results were filtered to allow identification of high probability proteins. There was little overlap in proteins identified by both algorithms, on average just ~24% of the total. However, when adjusted for spectral abundance, the overlap improved to ~70%. Proteome Discoverer generally outperformed X!Tandem, identifying an average of 12.5% more proteins than X!Tandem, with X!Tandem identifying more proteins only in the first two proteomes. For spectrally-adjusted results, the algorithms were similar, with X!Tandem marginally outperforming Proteome Discoverer by an average of ~4%. We then assessed differences in heat shock proteins (HSP) identification by the two algorithms by BLASTing identified proteins against the Heat Shock Protein Information Resource, because HSP hits typically account for the majority signal in proteomes, due to extraction protocols. Total HSP identifications for each of the 4 proteomes were approximately ~15%, ~11%, ~17%, and ~19%, with ~14% for total HSPs with redundancies removed. Of the ~15% average of proteins from the 4 proteomes identified as HSPs, ~10% of proteins and spectra were identified by both algorithms. On average, Proteome Discoverer identified ~9% more HSPs than X!Tandem.
Identifying protein complexes based on brainstorming strategy.
Shen, Xianjun; Zhou, Jin; Yi, Li; Hu, Xiaohua; He, Tingting; Yang, Jincai
2016-11-01
Protein complexes comprising of interacting proteins in protein-protein interaction network (PPI network) play a central role in driving biological processes within cells. Recently, more and more swarm intelligence based algorithms to detect protein complexes have been emerging, which have become the research hotspot in proteomics field. In this paper, we propose a novel algorithm for identifying protein complexes based on brainstorming strategy (IPC-BSS), which is integrated into the main idea of swarm intelligence optimization and the improved K-means algorithm. Distance between the nodes in PPI network is defined by combining the network topology and gene ontology (GO) information. Inspired by human brainstorming process, IPC-BSS algorithm firstly selects the clustering center nodes, and then they are separately consolidated with the other nodes with short distance to form initial clusters. Finally, we put forward two ways of updating the initial clusters to search optimal results. Experimental results show that our IPC-BSS algorithm outperforms the other classic algorithms on yeast and human PPI networks, and it obtains many predicted protein complexes with biological significance. Copyright © 2016 Elsevier Inc. All rights reserved.
Guaranteed Discrete Energy Optimization on Large Protein Design Problems.
Simoncini, David; Allouche, David; de Givry, Simon; Delmas, Céline; Barbe, Sophie; Schiex, Thomas
2015-12-08
In Computational Protein Design (CPD), assuming a rigid backbone and amino-acid rotamer library, the problem of finding a sequence with an optimal conformation is NP-hard. In this paper, using Dunbrack's rotamer library and Talaris2014 decomposable energy function, we use an exact deterministic method combining branch and bound, arc consistency, and tree-decomposition to provenly identify the global minimum energy sequence-conformation on full-redesign problems, defining search spaces of size up to 10(234). This is achieved on a single core of a standard computing server, requiring a maximum of 66GB RAM. A variant of the algorithm is able to exhaustively enumerate all sequence-conformations within an energy threshold of the optimum. These proven optimal solutions are then used to evaluate the frequencies and amplitudes, in energy and sequence, at which an existing CPD-dedicated simulated annealing implementation may miss the optimum on these full redesign problems. The probability of finding an optimum drops close to 0 very quickly. In the worst case, despite 1,000 repeats, the annealing algorithm remained more than 1 Rosetta unit away from the optimum, leading to design sequences that could differ from the optimal sequence by more than 30% of their amino acids.
A General, Adaptive, Roadmap-Based Algorithm for Protein Motion Computation.
Molloy, Kevin; Shehu, Amarda
2016-03-01
Precious information on protein function can be extracted from a detailed characterization of protein equilibrium dynamics. This remains elusive in wet and dry laboratories, as function-modulating transitions of a protein between functionally-relevant, thermodynamically-stable and meta-stable structural states often span disparate time scales. In this paper we propose a novel, robotics-inspired algorithm that circumvents time-scale challenges by drawing analogies between protein motion and robot motion. The algorithm adapts the popular roadmap-based framework in robot motion computation to handle the more complex protein conformation space and its underlying rugged energy surface. Given known structures representing stable and meta-stable states of a protein, the algorithm yields a time- and energy-prioritized list of transition paths between the structures, with each path represented as a series of conformations. The algorithm balances computational resources between a global search aimed at obtaining a global view of the network of protein conformations and their connectivity and a detailed local search focused on realizing such connections with physically-realistic models. Promising results are presented on a variety of proteins that demonstrate the general utility of the algorithm and its capability to improve the state of the art without employing system-specific insight.
Kurkcuoglu, Zeynep; Doruker, Pemra
2016-01-01
Incorporating receptor flexibility in small ligand-protein docking still poses a challenge for proteins undergoing large conformational changes. In the absence of bound structures, sampling conformers that are accessible by apo state may facilitate docking and drug design studies. For this aim, we developed an unbiased conformational search algorithm, by integrating global modes from elastic network model, clustering and energy minimization with implicit solvation. Our dataset consists of five diverse proteins with apo to complex RMSDs 4.7–15 Å. Applying this iterative algorithm on apo structures, conformers close to the bound-state (RMSD 1.4–3.8 Å), as well as the intermediate states were generated. Dockings to a sequence of conformers consisting of a closed structure and its “parents” up to the apo were performed to compare binding poses on different states of the receptor. For two periplasmic binding proteins and biotin carboxylase that exhibit hinge-type closure of two dynamics domains, the best pose was obtained for the conformer closest to the bound structure (ligand RMSDs 1.5–2 Å). In contrast, the best pose for adenylate kinase corresponded to an intermediate state with partially closed LID domain and open NMP domain, in line with recent studies (ligand RMSD 2.9 Å). The docking of a helical peptide to calmodulin was the most challenging case due to the complexity of its 15 Å transition, for which a two-stage procedure was necessary. The technique was first applied on the extended calmodulin to generate intermediate conformers; then peptide docking and a second generation stage on the complex were performed, which in turn yielded a final peptide RMSD of 2.9 Å. Our algorithm is effective in producing conformational states based on the apo state. This study underlines the importance of such intermediate states for ligand docking to proteins undergoing large transitions. PMID:27348230
Prior knowledge based mining functional modules from Yeast PPI networks with gene ontology
2010-01-01
Background In the literature, there are fruitful algorithmic approaches for identification functional modules in protein-protein interactions (PPI) networks. Because of accumulation of large-scale interaction data on multiple organisms and non-recording interaction data in the existing PPI database, it is still emergent to design novel computational techniques that can be able to correctly and scalably analyze interaction data sets. Indeed there are a number of large scale biological data sets providing indirect evidence for protein-protein interaction relationships. Results The main aim of this paper is to present a prior knowledge based mining strategy to identify functional modules from PPI networks with the aid of Gene Ontology. Higher similarity value in Gene Ontology means that two gene products are more functionally related to each other, so it is better to group such gene products into one functional module. We study (i) to encode the functional pairs into the existing PPI networks; and (ii) to use these functional pairs as pairwise constraints to supervise the existing functional module identification algorithms. Topology-based modularity metric and complex annotation in MIPs will be used to evaluate the identified functional modules by these two approaches. Conclusions The experimental results on Yeast PPI networks and GO have shown that the prior knowledge based learning methods perform better than the existing algorithms. PMID:21172053
A simple and fast heuristic for protein structure comparison.
Pelta, David A; González, Juan R; Moreno Vega, Marcos
2008-03-25
Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results.
Semiempirical prediction of protein folds
NASA Astrophysics Data System (ADS)
Fernández, Ariel; Colubri, Andrés; Appignanesi, Gustavo
2001-08-01
We introduce a semiempirical approach to predict ab initio expeditious pathways and native backbone geometries of proteins that fold under in vitro renaturation conditions. The algorithm is engineered to incorporate a discrete codification of local steric hindrances that constrain the movements of the peptide backbone throughout the folding process. Thus, the torsional state of the chain is assumed to be conditioned by the fact that hopping from one basin of attraction to another in the Ramachandran map (local potential energy surface) of each residue is energetically more costly than the search for a specific (Φ, Ψ) torsional state within a single basin. A combinatorial procedure is introduced to evaluate coarsely defined torsional states of the chain defined ``modulo basins'' and translate them into meaningful patterns of long range interactions. Thus, an algorithm for structure prediction is designed based on the fact that local contributions to the potential energy may be subsumed into time-evolving conformational constraints defining sets of restricted backbone geometries whereupon the patterns of nonbonded interactions are constructed. The predictive power of the algorithm is assessed by (a) computing ab initio folding pathways for mammalian ubiquitin that ultimately yield a stable structural pattern reproducing all of its native features, (b) determining the nucleating event that triggers the hydrophobic collapse of the chain, and (c) comparing coarse predictions of the stable folds of moderately large proteins (N~100) with structural information extracted from the protein data bank.
GPU-based cloud service for Smith-Waterman algorithm using frequency distance filtration scheme.
Lee, Sheng-Ta; Lin, Chun-Yuan; Hung, Che Lun
2013-01-01
As the conventional means of analyzing the similarity between a query sequence and database sequences, the Smith-Waterman algorithm is feasible for a database search owing to its high sensitivity. However, this algorithm is still quite time consuming. CUDA programming can improve computations efficiently by using the computational power of massive computing hardware as graphics processing units (GPUs). This work presents a novel Smith-Waterman algorithm with a frequency-based filtration method on GPUs rather than merely accelerating the comparisons yet expending computational resources to handle such unnecessary comparisons. A user friendly interface is also designed for potential cloud server applications with GPUs. Additionally, two data sets, H1N1 protein sequences (query sequence set) and human protein database (database set), are selected, followed by a comparison of CUDA-SW and CUDA-SW with the filtration method, referred to herein as CUDA-SWf. Experimental results indicate that reducing unnecessary sequence alignments can improve the computational time by up to 41%. Importantly, by using CUDA-SWf as a cloud service, this application can be accessed from any computing environment of a device with an Internet connection without time constraints.
Detection of protein complex from protein-protein interaction network using Markov clustering
NASA Astrophysics Data System (ADS)
Ochieng, P. J.; Kusuma, W. A.; Haryanto, T.
2017-05-01
Detection of complexes, or groups of functionally related proteins, is an important challenge while analysing biological networks. However, existing algorithms to identify protein complexes are insufficient when applied to dense networks of experimentally derived interaction data. Therefore, we introduced a graph clustering method based on Markov clustering algorithm to identify protein complex within highly interconnected protein-protein interaction networks. Protein-protein interaction network was first constructed to develop geometrical network, the network was then partitioned using Markov clustering to detect protein complexes. The interest of the proposed method was illustrated by its application to Human Proteins associated to type II diabetes mellitus. Flow simulation of MCL algorithm was initially performed and topological properties of the resultant network were analysed for detection of the protein complex. The results indicated the proposed method successfully detect an overall of 34 complexes with 11 complexes consisting of overlapping modules and 20 non-overlapping modules. The major complex consisted of 102 proteins and 521 interactions with cluster modularity and density of 0.745 and 0.101 respectively. The comparison analysis revealed MCL out perform AP, MCODE and SCPS algorithms with high clustering coefficient (0.751) network density and modularity index (0.630). This demonstrated MCL was the most reliable and efficient graph clustering algorithm for detection of protein complexes from PPI networks.
Abramyan, Tigran M; Snyder, James A; Thyparambil, Aby A; Stuart, Steven J; Latour, Robert A
2016-08-05
Clustering methods have been widely used to group together similar conformational states from molecular simulations of biomolecules in solution. For applications such as the interaction of a protein with a surface, the orientation of the protein relative to the surface is also an important clustering parameter because of its potential effect on adsorbed-state bioactivity. This study presents cluster analysis methods that are specifically designed for systems where both molecular orientation and conformation are important, and the methods are demonstrated using test cases of adsorbed proteins for validation. Additionally, because cluster analysis can be a very subjective process, an objective procedure for identifying both the optimal number of clusters and the best clustering algorithm to be applied to analyze a given dataset is presented. The method is demonstrated for several agglomerative hierarchical clustering algorithms used in conjunction with three cluster validation techniques. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
A de novo redesign of the WW domain
Kraemer-Pecore, Christina M.; Lecomte, Juliette T.J.; Desjarlais, John R.
2003-01-01
We have used a sequence prediction algorithm and a novel sampling method to design protein sequences for the WW domain, a small β-sheet motif. The procedure, referred to as SPANS, designs sequences to be compatible with an ensemble of closely related polypeptide backbones, mimicking the inherent flexibility of proteins. Two designed sequences (termed SPANS-WW1 and SPANS-WW2), using only naturally occurring l-amino acids, were selected for study and the corresponding polypeptides were prepared in Escherichia coli. Circular dichroism data suggested that both purified polypeptides adopted secondary structure features related to those of the target without the aid of disulfide bridges or bound cofactors. The structure exhibited by SPANS-WW2 melted cooperatively by raising the temperature of the solution. Further analysis of this polypeptide by proton nuclear magnetic resonance spectroscopy demonstrated that at 5°C, it folds into a structure closely resembling a natural WW domain. This achievement constitutes one of a small number of successful de novo protein designs through fully automated computational methods and highlights the feasibility of including backbone flexibility in the design strategy. PMID:14500877
A de novo redesign of the WW domain.
Kraemer-Pecore, Christina M; Lecomte, Juliette T J; Desjarlais, John R
2003-10-01
We have used a sequence prediction algorithm and a novel sampling method to design protein sequences for the WW domain, a small beta-sheet motif. The procedure, referred to as SPANS, designs sequences to be compatible with an ensemble of closely related polypeptide backbones, mimicking the inherent flexibility of proteins. Two designed sequences (termed SPANS-WW1 and SPANS-WW2), using only naturally occurring L-amino acids, were selected for study and the corresponding polypeptides were prepared in Escherichia coli. Circular dichroism data suggested that both purified polypeptides adopted secondary structure features related to those of the target without the aid of disulfide bridges or bound cofactors. The structure exhibited by SPANS-WW2 melted cooperatively by raising the temperature of the solution. Further analysis of this polypeptide by proton nuclear magnetic resonance spectroscopy demonstrated that at 5 degrees C, it folds into a structure closely resembling a natural WW domain. This achievement constitutes one of a small number of successful de novo protein designs through fully automated computational methods and highlights the feasibility of including backbone flexibility in the design strategy.
BayesMotif: de novo protein sorting motif discovery from impure datasets.
Hu, Jianjun; Zhang, Fan
2010-01-18
Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.
A high level interface to SCOP and ASTRAL implemented in python.
Casbon, James A; Crooks, Gavin E; Saqi, Mansoor A S
2006-01-10
Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources. We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL. The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.
A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.
Ni, Qianwu; Chen, Lei
2017-01-01
Correct prediction of protein structural class is beneficial to investigation on protein functions, regulations and interactions. In recent years, several computational methods have been proposed in this regard. However, based on various features, it is still a great challenge to select proper classification algorithm and extract essential features to participate in classification. In this study, a feature and algorithm selection method was presented for improving the accuracy of protein structural class prediction. The amino acid compositions and physiochemical features were adopted to represent features and thirty-eight machine learning algorithms collected in Weka were employed. All features were first analyzed by a feature selection method, minimum redundancy maximum relevance (mRMR), producing a feature list. Then, several feature sets were constructed by adding features in the list one by one. For each feature set, thirtyeight algorithms were executed on a dataset, in which proteins were represented by features in the set. The predicted classes yielded by these algorithms and true class of each protein were collected to construct a dataset, which were analyzed by mRMR method, yielding an algorithm list. From the algorithm list, the algorithm was taken one by one to build an ensemble prediction model. Finally, we selected the ensemble prediction model with the best performance as the optimal ensemble prediction model. Experimental results indicate that the constructed model is much superior to models using single algorithm and other models that only adopt feature selection procedure or algorithm selection procedure. The feature selection procedure or algorithm selection procedure are really helpful for building an ensemble prediction model that can yield a better performance. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
De novo design of the hydrophobic core of ubiquitin.
Lazar, G. A.; Desjarlais, J. R.; Handel, T. M.
1997-01-01
We have previously reported the development and evaluation of a computational program to assist in the design of hydrophobic cores of proteins. In an effort to investigate the role of core packing in protein structure, we have used this program, referred to as Repacking of Cores (ROC), to design several variants of the protein ubiquitin. Nine ubiquitin variants containing from three to eight hydrophobic core mutations were constructed, purified, and characterized in terms of their stability and their ability to adopt a uniquely folded native-like conformation. In general, designed ubiquitin variants are more stable than control variants in which the hydrophobic core was chosen randomly. However, in contrast to previous results with 434 cro, all designs are destabilized relative to the wild-type (WT) protein. This raises the possibility that beta-sheet structures have more stringent packing requirements than alpha-helical proteins. A more striking observation is that all variants, including random controls, adopt fairly well-defined conformations, regardless of their stability. This result supports conclusions from the cro studies that non-core residues contribute significantly to the conformational uniqueness of these proteins while core packing largely affects protein stability and has less impact on the nature or uniqueness of the fold. Concurrent with the above work, we used stability data on the nine ubiquitin variants to evaluate and improve the predictive ability of our core packing algorithm. Additional versions of the program were generated that differ in potential function parameters and sampling of side chain conformers. Reasonable correlations between experimental and predicted stabilities suggest the program will be useful in future studies to design variants with stabilities closer to that of the native protein. Taken together, the present study provides further clarification of the role of specific packing interactions in protein structure and stability, and demonstrates the benefit of using systematic computational methods to predict core packing arrangements for the design of proteins. PMID:9194177
Zhao, Xiaowei; Ning, Qiao; Chai, Haiting; Ma, Zhiqiang
2015-06-07
As a widespread type of protein post-translational modifications (PTMs), succinylation plays an important role in regulating protein conformation, function and physicochemical properties. Compared with the labor-intensive and time-consuming experimental approaches, computational predictions of succinylation sites are much desirable due to their convenient and fast speed. Currently, numerous computational models have been developed to identify PTMs sites through various types of two-class machine learning algorithms. These methods require both positive and negative samples for training. However, designation of the negative samples of PTMs was difficult and if it is not properly done can affect the performance of computational models dramatically. So that in this work, we implemented the first application of positive samples only learning (PSoL) algorithm to succinylation sites prediction problem, which was a special class of semi-supervised machine learning that used positive samples and unlabeled samples to train the model. Meanwhile, we proposed a novel succinylation sites computational predictor called SucPred (succinylation site predictor) by using multiple feature encoding schemes. Promising results were obtained by the SucPred predictor with an accuracy of 88.65% using 5-fold cross validation on the training dataset and an accuracy of 84.40% on the independent testing dataset, which demonstrated that the positive samples only learning algorithm presented here was particularly useful for identification of protein succinylation sites. Besides, the positive samples only learning algorithm can be applied to build predictors for other types of PTMs sites with ease. A web server for predicting succinylation sites was developed and was freely accessible at http://59.73.198.144:8088/SucPred/. Copyright © 2015 Elsevier Ltd. All rights reserved.
Le, Duc-Hau
2015-01-01
Protein complexes formed by non-covalent interaction among proteins play important roles in cellular functions. Computational and purification methods have been used to identify many protein complexes and their cellular functions. However, their roles in terms of causing disease have not been well discovered yet. There exist only a few studies for the identification of disease-associated protein complexes. However, they mostly utilize complicated heterogeneous networks which are constructed based on an out-of-date database of phenotype similarity network collected from literature. In addition, they only apply for diseases for which tissue-specific data exist. In this study, we propose a method to identify novel disease-protein complex associations. First, we introduce a framework to construct functional similarity protein complex networks where two protein complexes are functionally connected by either shared protein elements, shared annotating GO terms or based on protein interactions between elements in each protein complex. Second, we propose a simple but effective neighborhood-based algorithm, which yields a local similarity measure, to rank disease candidate protein complexes. Comparing the predictive performance of our proposed algorithm with that of two state-of-the-art network propagation algorithms including one we used in our previous study, we found that it performed statistically significantly better than that of these two algorithms for all the constructed functional similarity protein complex networks. In addition, it ran about 32 times faster than these two algorithms. Moreover, our proposed method always achieved high performance in terms of AUC values irrespective of the ways to construct the functional similarity protein complex networks and the used algorithms. The performance of our method was also higher than that reported in some existing methods which were based on complicated heterogeneous networks. Finally, we also tested our method with prostate cancer and selected the top 100 highly ranked candidate protein complexes. Interestingly, 69 of them were evidenced since at least one of their protein elements are known to be associated with prostate cancer. Our proposed method, including the framework to construct functional similarity protein complex networks and the neighborhood-based algorithm on these networks, could be used for identification of novel disease-protein complex associations.
Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort.
Jeschek, Markus; Gerngross, Daniel; Panke, Sven
2016-03-31
Rational flux design in metabolic engineering approaches remains difficult since important pathway information is frequently not available. Therefore empirical methods are applied that randomly change absolute and relative pathway enzyme levels and subsequently screen for variants with improved performance. However, screening is often limited on the analytical side, generating a strong incentive to construct small but smart libraries. Here we introduce RedLibs (Reduced Libraries), an algorithm that allows for the rational design of smart combinatorial libraries for pathway optimization thereby minimizing the use of experimental resources. We demonstrate the utility of RedLibs for the design of ribosome-binding site libraries by in silico and in vivo screening with fluorescent proteins and perform a simple two-step optimization of the product selectivity in the branched multistep pathway for violacein biosynthesis, indicating a general applicability for the algorithm and the proposed heuristics. We expect that RedLibs will substantially simplify the refactoring of synthetic metabolic pathways.
Audain, Enrique; Uszkoreit, Julian; Sachsenberg, Timo; Pfeuffer, Julianus; Liang, Xiao; Hermjakob, Henning; Sanchez, Aniel; Eisenacher, Martin; Reinert, Knut; Tabb, David L; Kohlbacher, Oliver; Perez-Riverol, Yasset
2017-01-06
In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations. Copyright © 2016. Published by Elsevier B.V.
Huang, Chien-Hung; Peng, Huai-Shun; Ng, Ka-Lok
2015-01-01
Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.
2015-01-01
Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis. PMID:25866773
Exploration of the relationship between topology and designability of conformations
NASA Astrophysics Data System (ADS)
Leelananda, Sumudu P.; Towfic, Fadi; Jernigan, Robert L.; Kloczkowski, Andrzej
2011-06-01
Protein structures are evolutionarily more conserved than sequences, and sequences with very low sequence identity frequently share the same fold. This leads to the concept of protein designability. Some folds are more designable and lots of sequences can assume that fold. Elucidating the relationship between protein sequence and the three-dimensional (3D) structure that the sequence folds into is an important problem in computational structural biology. Lattice models have been utilized in numerous studies to model protein folds and predict the designability of certain folds. In this study, all possible compact conformations within a set of two-dimensional and 3D lattice spaces are explored. Complementary interaction graphs are then generated for each conformation and are described using a set of graph features. The full HP sequence space for each lattice model is generated and contact energies are calculated by threading each sequence onto all the possible conformations. Unique conformation giving minimum energy is identified for each sequence and the number of sequences folding to each conformation (designability) is obtained. Machine learning algorithms are used to predict the designability of each conformation. We find that the highly designable structures can be distinguished from other non-designable conformations based on certain graphical geometric features of the interactions. This finding confirms the fact that the topology of a conformation is an important determinant of the extent of its designability and suggests that the interactions themselves are important for determining the designability.
A collaborative filtering approach for protein-protein docking scoring functions.
Bourquard, Thomas; Bernauer, Julie; Azé, Jérôme; Poupon, Anne
2011-04-22
A protein-protein docking procedure traditionally consists in two successive tasks: a search algorithm generates a large number of candidate conformations mimicking the complex existing in vivo between two proteins, and a scoring function is used to rank them in order to extract a native-like one. We have already shown that using Voronoi constructions and a well chosen set of parameters, an accurate scoring function could be designed and optimized. However to be able to perform large-scale in silico exploration of the interactome, a near-native solution has to be found in the ten best-ranked solutions. This cannot yet be guaranteed by any of the existing scoring functions. In this work, we introduce a new procedure for conformation ranking. We previously developed a set of scoring functions where learning was performed using a genetic algorithm. These functions were used to assign a rank to each possible conformation. We now have a refined rank using different classifiers (decision trees, rules and support vector machines) in a collaborative filtering scheme. The scoring function newly obtained is evaluated using 10 fold cross-validation, and compared to the functions obtained using either genetic algorithms or collaborative filtering taken separately. This new approach was successfully applied to the CAPRI scoring ensembles. We show that for 10 targets out of 12, we are able to find a near-native conformation in the 10 best ranked solutions. Moreover, for 6 of them, the near-native conformation selected is of high accuracy. Finally, we show that this function dramatically enriches the 100 best-ranking conformations in near-native structures.
A Collaborative Filtering Approach for Protein-Protein Docking Scoring Functions
Bourquard, Thomas; Bernauer, Julie; Azé, Jérôme; Poupon, Anne
2011-01-01
A protein-protein docking procedure traditionally consists in two successive tasks: a search algorithm generates a large number of candidate conformations mimicking the complex existing in vivo between two proteins, and a scoring function is used to rank them in order to extract a native-like one. We have already shown that using Voronoi constructions and a well chosen set of parameters, an accurate scoring function could be designed and optimized. However to be able to perform large-scale in silico exploration of the interactome, a near-native solution has to be found in the ten best-ranked solutions. This cannot yet be guaranteed by any of the existing scoring functions. In this work, we introduce a new procedure for conformation ranking. We previously developed a set of scoring functions where learning was performed using a genetic algorithm. These functions were used to assign a rank to each possible conformation. We now have a refined rank using different classifiers (decision trees, rules and support vector machines) in a collaborative filtering scheme. The scoring function newly obtained is evaluated using 10 fold cross-validation, and compared to the functions obtained using either genetic algorithms or collaborative filtering taken separately. This new approach was successfully applied to the CAPRI scoring ensembles. We show that for 10 targets out of 12, we are able to find a near-native conformation in the 10 best ranked solutions. Moreover, for 6 of them, the near-native conformation selected is of high accuracy. Finally, we show that this function dramatically enriches the 100 best-ranking conformations in near-native structures. PMID:21526112
Brasil, Christiane Regina Soares; Delbem, Alexandre Claudio Botazzo; da Silva, Fernando Luís Barroso
2013-07-30
This article focuses on the development of an approach for ab initio protein structure prediction (PSP) without using any earlier knowledge from similar protein structures, as fragment-based statistics or inference of secondary structures. Such an approach is called purely ab initio prediction. The article shows that well-designed multiobjective evolutionary algorithms can predict relevant protein structures in a purely ab initio way. One challenge for purely ab initio PSP is the prediction of structures with β-sheets. To work with such proteins, this research has also developed procedures to efficiently estimate hydrogen bond and solvation contribution energies. Considering van der Waals, electrostatic, hydrogen bond, and solvation contribution energies, the PSP is a problem with four energetic terms to be minimized. Each interaction energy term can be considered an objective of an optimization method. Combinatorial problems with four objectives have been considered too complex for the available multiobjective optimization (MOO) methods. The proposed approach, called "Multiobjective evolutionary algorithms with many tables" (MEAMT), can efficiently deal with four objectives through the combination thereof, performing a more adequate sampling of the objective space. Therefore, this method can better map the promising regions in this space, predicting structures in a purely ab initio way. In other words, MEAMT is an efficient optimization method for MOO, which explores simultaneously the search space as well as the objective space. MEAMT can predict structures with one or two domains with RMSDs comparable to values obtained by recently developed ab initio methods (GAPFCG , I-PAES, and Quark) that use different levels of earlier knowledge. Copyright © 2013 Wiley Periodicals, Inc.
Redesigning the specificity of protein-DNA interactions with Rosetta.
Thyme, Summer; Baker, David
2014-01-01
Building protein tools that can selectively bind or cleave specific DNA sequences requires efficient technologies for modifying protein-DNA interactions. Computational design is one method for accomplishing this goal. In this chapter, we present the current state of protein-DNA interface design with the Rosetta macromolecular modeling program. The LAGLIDADG endonuclease family of DNA-cleaving enzymes, under study as potential gene therapy reagents, has been the main testing ground for these in silico protocols. At this time, the computational methods are most useful for designing endonuclease variants that can accommodate small numbers of target site substitutions. Attempts to engineer for more extensive interface changes will likely benefit from an approach that uses the computational design results in conjunction with a high-throughput directed evolution or screening procedure. The family of enzymes presents an engineering challenge because their interfaces are highly integrated and there is significant coordination between the binding and catalysis events. Future developments in the computational algorithms depend on experimental feedback to improve understanding and modeling of these complex enzymatic features. This chapter presents both the basic method of design that has been successfully used to modulate specificity and more advanced procedures that incorporate DNA flexibility and other properties that are likely necessary for reliable modeling of more extensive target site changes.
A simple and fast heuristic for protein structure comparison
Pelta, David A; González, Juan R; Moreno Vega, Marcos
2008-01-01
Background Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. Results We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. Conclusion We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results. Software is available for download at . PMID:18366735
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains.
Jha, Ashwani; Flurchick, K M; Bikdash, Marwan; Kc, Dukka B
2016-01-01
Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10-15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains
Jha, Ashwani; Flurchick, K. M.; Bikdash, Marwan
2016-01-01
Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors. PMID:27747230
Smith, Colin A; Kortemme, Tanja
2011-01-01
Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.
Design of an allosterically modulated doxycycline and doxorubicin drug-binding protein.
Schmidt, Karin; Gardill, Bernd R; Kern, Alina; Kirchweger, Peter; Börsch, Michael; Muller, Yves A
2018-05-14
The allosteric interplay between distant functional sites present in a single protein provides for one of the most important regulatory mechanisms in biological systems. While the design of ligand-binding sites into proteins remains challenging, this holds even truer for the coupling of a newly engineered binding site to an allosteric mechanism that regulates the ligand affinity. Here it is shown how computational design algorithms enabled the introduction of doxycycline- and doxorubicin-binding sites into the serine proteinase inhibitor (serpin) family member α1-antichymotrypsin. Further engineering allowed exploitation of the proteinase-triggered serpin-typical S-to-R transition to modulate the ligand affinities. These design variants follow strategies observed in naturally occurring plasma globulins that allow for the targeted delivery of hormones in the blood. By analogy, we propose that the variants described in the present study could be further developed to allow for the delivery of the antibiotic doxycycline and the anticancer compound doxorubicin to tissues/locations that express specific proteinases, such as bacterial infection sites or tumor cells secreting matrix metalloproteinases.
Protein docking prediction using predicted protein-protein interface.
Li, Bin; Kihara, Daisuke
2012-01-10
Many important cellular processes are carried out by protein complexes. To provide physical pictures of interacting proteins, many computational protein-protein prediction methods have been developed in the past. However, it is still difficult to identify the correct docking complex structure within top ranks among alternative conformations. We present a novel protein docking algorithm that utilizes imperfect protein-protein binding interface prediction for guiding protein docking. Since the accuracy of protein binding site prediction varies depending on cases, the challenge is to develop a method which does not deteriorate but improves docking results by using a binding site prediction which may not be 100% accurate. The algorithm, named PI-LZerD (using Predicted Interface with Local 3D Zernike descriptor-based Docking algorithm), is based on a pair wise protein docking prediction algorithm, LZerD, which we have developed earlier. PI-LZerD starts from performing docking prediction using the provided protein-protein binding interface prediction as constraints, which is followed by the second round of docking with updated docking interface information to further improve docking conformation. Benchmark results on bound and unbound cases show that PI-LZerD consistently improves the docking prediction accuracy as compared with docking without using binding site prediction or using the binding site prediction as post-filtering. We have developed PI-LZerD, a pairwise docking algorithm, which uses imperfect protein-protein binding interface prediction to improve docking accuracy. PI-LZerD consistently showed better prediction accuracy over alternative methods in the series of benchmark experiments including docking using actual docking interface site predictions as well as unbound docking cases.
Accounting for receptor flexibility and enhanced sampling methods in computer-aided drug design.
Sinko, William; Lindert, Steffen; McCammon, J Andrew
2013-01-01
Protein flexibility plays a major role in biomolecular recognition. In many cases, it is not obvious how molecular structure will change upon association with other molecules. In proteins, these changes can be major, with large deviations in overall backbone structure, or they can be more subtle as in a side-chain rotation. Either way the algorithms that predict the favorability of biomolecular association require relatively accurate predictions of the bound structure to give an accurate assessment of the energy involved in association. Here, we review a number of techniques that have been proposed to accommodate receptor flexibility in the simulation of small molecules binding to protein receptors. We investigate modifications to standard rigid receptor docking algorithms and also explore enhanced sampling techniques, and the combination of free energy calculations and enhanced sampling techniques. The understanding and allowance for receptor flexibility are helping to make computer simulations of ligand protein binding more accurate. These developments may help improve the efficiency of drug discovery and development. Efficiency will be essential as we begin to see personalized medicine tailored to individual patients, which means specific drugs are needed for each patient's genetic makeup. © 2012 John Wiley & Sons A/S.
Biclustering Protein Complex Interactions with a Biclique FindingAlgorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ding, Chris; Zhang, Anne Ya; Holbrook, Stephen
2006-12-01
Biclustering has many applications in text mining, web clickstream mining, and bioinformatics. When data entries are binary, the tightest biclusters become bicliques. We propose a flexible and highly efficient algorithm to compute bicliques. We first generalize the Motzkin-Straus formalism for computing the maximal clique from L{sub 1} constraint to L{sub p} constraint, which enables us to provide a generalized Motzkin-Straus formalism for computing maximal-edge bicliques. By adjusting parameters, the algorithm can favor biclusters with more rows less columns, or vice verse, thus increasing the flexibility of the targeted biclusters. We then propose an algorithm to solve the generalized Motzkin-Straus optimizationmore » problem. The algorithm is provably convergent and has a computational complexity of O(|E|) where |E| is the number of edges. It relies on a matrix vector multiplication and runs efficiently on most current computer architectures. Using this algorithm, we bicluster the yeast protein complex interaction network. We find that biclustering protein complexes at the protein level does not clearly reflect the functional linkage among protein complexes in many cases, while biclustering at protein domain level can reveal many underlying linkages. We show several new biologically significant results.« less
Theoretical and computational studies in protein folding, design, and function
NASA Astrophysics Data System (ADS)
Morrissey, Michael Patrick
2000-10-01
In this work, simplified statistical models are used to understand an array of processes related to protein folding and design. In Part I, lattice models are utilized to test several theories about the statistical properties of protein-like systems. In Part II, sequence analysis and all-atom simulations are used to advance a novel theory for the behavior of a particular protein. Part I is divided into five chapters. In Chapter 2, a method of sequence design for model proteins, based on statistical mechanical first-principles, is developed. The cumulant design method uses a mean-field approximation to expand the free energy of a sequence in temperature. The method successfully designs sequences which fold to a target lattice structure at a specific temperature, a feat which was not possible using previous design methods. The next three chapters are computational studies of the double mutant cycle, which has been used experimentally to predict intra-protein interactions. Complete structure prediction is demonstrated for a model system using exhaustive, and also sub-exhaustive, double mutants. Nonadditivity of enthalpy, rather than of free energy, is proposed and demonstrated to be a superior marker for inter-residue contact. Next, a new double mutant protocol, called exchange mutation, is introduced. Although simple statistical arguments predict exchange mutation to be a more accurate contact predictor than standard mutant cycles, this hypothesis was not upheld in lattice simulations. Reasons for this inconsistency will be discussed. Finally, a multi-chain folding algorithm is introduced. Known as LINKS, this algorithm was developed to test a method of structure prediction which utilizes chain-break mutants. While structure prediction was not successful, LINKS should nevertheless be a useful tool for the study of protein-protein and protein-ligand interactions. The last chapter of Part I utilizes the lattice to explore the differences between standard folding, from the fully denatured state, and cotranslational folding, whereby one end of a protein is synthesized and released before the other. Cotranslational folding is shown to accelerate folding kinetics, particularly when the target backbone contains many local contacts. Additionally, cotranslation is shown capable of "guiding" a model protein into a metastable, local contact-rich state, despite the existence of a true native state of much lower energy. In Part II, a model is developed for the behavior of PrP, a unique mammalian protein which has been shown to possess two native states. The pathogenic "scrapie" state PrPSc, which has not been structurally characterized, is known to trigger conversion of the characterized endogenous conformation PrPC into additional PrPSc, Residues 144--153 are shown to form the most hydrophilic naturally occurring alpha-helix, out of a broad database with more than 10,000 candidates. The novel beta-nucleation model proposes that PrPSc, is not a distinct mono-molecular state, but is rather a beta-sheet-like aggregate centered around helix-1 components of multiple PrP molecules. The remainder of Part II uses molecular dynamics simulations to support the beta-nucleation hypothesis, and to propose a system of peptide ligands which may arrest the process of prion propagation.
Application of ant colony Algorithm and particle swarm optimization in architectural design
NASA Astrophysics Data System (ADS)
Song, Ziyi; Wu, Yunfa; Song, Jianhua
2018-02-01
By studying the development of ant colony algorithm and particle swarm algorithm, this paper expounds the core idea of the algorithm, explores the combination of algorithm and architectural design, sums up the application rules of intelligent algorithm in architectural design, and combines the characteristics of the two algorithms, obtains the research route and realization way of intelligent algorithm in architecture design. To establish algorithm rules to assist architectural design. Taking intelligent algorithm as the beginning of architectural design research, the authors provide the theory foundation of ant colony Algorithm and particle swarm algorithm in architectural design, popularize the application range of intelligent algorithm in architectural design, and provide a new idea for the architects.
Investigation of Carbohydrate Recognition via Computer Simulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Quentin R.; Lindsay, Richard J.; Petridis, Loukas
Carbohydrate recognition by proteins, such as lectins and other (bio)molecules, can be essential for many biological functions. Interest has arisen due to potential protein and drug design and future bioengineering applications. A quantitative measurement of carbohydrate-protein interaction is thus important for the full characterization of sugar recognition. Here, we focus on the aspect of utilizing computer simulations and biophysical models to evaluate the strength and specificity of carbohydrate recognition in this review. With increasing computational resources, better algorithms and refined modeling parameters, using state-of-the-art supercomputers to calculate the strength of the interaction between molecules has become increasingly mainstream. We reviewmore » the current state of this technique and its successful applications for studying protein-sugar interactions in recent years.« less
Investigation of Carbohydrate Recognition via Computer Simulation
Johnson, Quentin R.; Lindsay, Richard J.; Petridis, Loukas; ...
2015-04-28
Carbohydrate recognition by proteins, such as lectins and other (bio)molecules, can be essential for many biological functions. Interest has arisen due to potential protein and drug design and future bioengineering applications. A quantitative measurement of carbohydrate-protein interaction is thus important for the full characterization of sugar recognition. Here, we focus on the aspect of utilizing computer simulations and biophysical models to evaluate the strength and specificity of carbohydrate recognition in this review. With increasing computational resources, better algorithms and refined modeling parameters, using state-of-the-art supercomputers to calculate the strength of the interaction between molecules has become increasingly mainstream. We reviewmore » the current state of this technique and its successful applications for studying protein-sugar interactions in recent years.« less
Integrated Bio-Entity Network: A System for Biological Knowledge Discovery
Bell, Lindsey; Chowdhary, Rajesh; Liu, Jun S.; Niu, Xufeng; Zhang, Jinfeng
2011-01-01
A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs. PMID:21738677
Folding and Stabilization of Native-Sequence-Reversed Proteins
Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong
2016-01-01
Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols. PMID:27113844
Folding and Stabilization of Native-Sequence-Reversed Proteins
NASA Astrophysics Data System (ADS)
Zhang, Yuanzhao; Weber, Jeffrey K.; Zhou, Ruhong
2016-04-01
Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols.
Jiang, Xiaoying; Wei, Rong; Zhao, Yanjun; Zhang, Tongliang
2008-05-01
The knowledge of subnuclear localization in eukaryotic cells is essential for understanding the life function of nucleus. Developing prediction methods and tools for proteins subnuclear localization become important research fields in protein science for special characteristics in cell nuclear. In this study, a novel approach has been proposed to predict protein subnuclear localization. Sample of protein is represented by Pseudo Amino Acid (PseAA) composition based on approximate entropy (ApEn) concept, which reflects the complexity of time series. A novel ensemble classifier is designed incorporating three AdaBoost classifiers. The base classifier algorithms in three AdaBoost are decision stumps, fuzzy K nearest neighbors classifier, and radial basis-support vector machines, respectively. Different PseAA compositions are used as input data of different AdaBoost classifier in ensemble. Genetic algorithm is used to optimize the dimension and weight factor of PseAA composition. Two datasets often used in published works are used to validate the performance of the proposed approach. The obtained results of Jackknife cross-validation test are higher and more balance than them of other methods on same datasets. The promising results indicate that the proposed approach is effective and practical. It might become a useful tool in protein subnuclear localization. The software in Matlab and supplementary materials are available freely by contacting the corresponding author.
A discrete search algorithm for finding the structure of protein backbones and side chains.
Sallaume, Silas; Martins, Simone de Lima; Ochi, Luiz Satoru; Da Silva, Warley Gramacho; Lavor, Carlile; Liberti, Leo
2013-01-01
Some information about protein structure can be obtained by using Nuclear Magnetic Resonance (NMR) techniques, but they provide only a sparse set of distances between atoms in a protein. The Molecular Distance Geometry Problem (MDGP) consists in determining the three-dimensional structure of a molecule using a set of known distances between some atoms. Recently, a Branch and Prune (BP) algorithm was proposed to calculate the backbone of a protein, based on a discrete formulation for the MDGP. We present an extension of the BP algorithm that can calculate not only the protein backbone, but the whole three-dimensional structure of proteins.
Kilambi, Krishna Praneeth; Pacella, Michael S; Xu, Jianqing; Labonte, Jason W; Porter, Justin R; Muthu, Pravin; Drew, Kevin; Kuroda, Daisuke; Schueler-Furman, Ora; Bonneau, Richard; Gray, Jeffrey J
2013-12-01
Rounds 20-27 of the Critical Assessment of PRotein Interactions (CAPRI) provided a testing platform for computational methods designed to address a wide range of challenges. The diverse targets drove the creation of and new combinations of computational tools. In this study, RosettaDock and other novel Rosetta protocols were used to successfully predict four of the 10 blind targets. For example, for DNase domain of Colicin E2-Im2 immunity protein, RosettaDock and RosettaLigand were used to predict the positions of water molecules at the interface, recovering 46% of the native water-mediated contacts. For α-repeat Rep4-Rep2 and g-type lysozyme-PliG inhibitor complexes, homology models were built and standard and pH-sensitive docking algorithms were used to generate structures with interface RMSD values of 3.3 Å and 2.0 Å, respectively. A novel flexible sugar-protein docking protocol was also developed and used for structure prediction of the BT4661-heparin-like saccharide complex, recovering 71% of the native contacts. Challenges remain in the generation of accurate homology models for protein mutants and sampling during global docking. On proteins designed to bind influenza hemagglutinin, only about half of the mutations were identified that affect binding (T55: 54%; T56: 48%). The prediction of the structure of the xylanase complex involving homology modeling and multidomain docking pushed the limits of global conformational sampling and did not result in any successful prediction. The diversity of problems at hand requires computational algorithms to be versatile; the recent additions to the Rosetta suite expand the capabilities to encompass more biologically realistic docking problems. Copyright © 2013 Wiley Periodicals, Inc.
Peptide Peak Detection for Low Resolution MALDI-TOF Mass Spectrometry.
Yao, Jingwen; Utsunomiya, Shin-Ichi; Kajihara, Shigeki; Tabata, Tsuyoshi; Aoshima, Ken; Oda, Yoshiya; Tanaka, Koichi
2014-01-01
A new peak detection method has been developed for rapid selection of peptide and its fragment ion peaks for protein identification using tandem mass spectrometry. The algorithm applies classification of peak intensities present in the defined mass range to determine the noise level. A threshold is then given to select ion peaks according to the determined noise level in each mass range. This algorithm was initially designed for the peak detection of low resolution peptide mass spectra, such as matrix-assisted laser desorption/ionization Time-of-Flight (MALDI-TOF) mass spectra. But it can also be applied to other type of mass spectra. This method has demonstrated obtaining a good rate of number of real ions to noises for even poorly fragmented peptide spectra. The effect of using peak lists generated from this method produces improved protein scores in database search results. The reliability of the protein identifications is increased by finding more peptide identifications. This software tool is freely available at the Mass++ home page (http://www.first-ms3d.jp/english/achievement/software/).
Peptide Peak Detection for Low Resolution MALDI-TOF Mass Spectrometry
Yao, Jingwen; Utsunomiya, Shin-ichi; Kajihara, Shigeki; Tabata, Tsuyoshi; Aoshima, Ken; Oda, Yoshiya; Tanaka, Koichi
2014-01-01
A new peak detection method has been developed for rapid selection of peptide and its fragment ion peaks for protein identification using tandem mass spectrometry. The algorithm applies classification of peak intensities present in the defined mass range to determine the noise level. A threshold is then given to select ion peaks according to the determined noise level in each mass range. This algorithm was initially designed for the peak detection of low resolution peptide mass spectra, such as matrix-assisted laser desorption/ionization Time-of-Flight (MALDI-TOF) mass spectra. But it can also be applied to other type of mass spectra. This method has demonstrated obtaining a good rate of number of real ions to noises for even poorly fragmented peptide spectra. The effect of using peak lists generated from this method produces improved protein scores in database search results. The reliability of the protein identifications is increased by finding more peptide identifications. This software tool is freely available at the Mass++ home page (http://www.first-ms3d.jp/english/achievement/software/). PMID:26819872
The tip of the iceberg: RNA-binding proteins with prion-like domains in neurodegenerative disease
King, Oliver D.; Gitler, Aaron D.; Shorter, James
2012-01-01
Prions are self-templating protein conformers that are naturally transmitted between individuals and promote phenotypic change. In yeast, prion-encoded phenotypes can be beneficial, neutral or deleterious depending upon genetic background and environmental conditions. A distinctive and portable ‘prion domain’ enriched in asparagine, glutamine, tyrosine and glycine residues unifies the majority of yeast prion proteins. Deletion of this domain precludes prionogenesis and appending this domain to reporter proteins can confer prionogenicity. An algorithm designed to detect prion domains has successfully identified 19 domains that can confer prion behavior. Scouring the human genome with this algorithm enriches a select group of RNA-binding proteins harboring a canonical RNA recognition motif (RRM) and a putative prion domain. Indeed, of 210 human RRM-bearing proteins, 29 have a putative prion domain, and 12 of these are in the top 60 prion candidates in the entire genome. Startlingly, these RNA-binding prion candidates are inexorably emerging, one by one, in the pathology and genetics of devastating neurodegenerative disorders, including: amyotrophic lateral sclerosis (ALS), frontotemporal lobar degeneration with ubiquitin-positive inclusions (FTLD-U), Alzheimer’s disease and Huntington’s disease. For example, FUS and TDP-43, which rank 1st and 10th among RRM-bearing prion candidates, form cytoplasmic inclusions in the degenerating motor neurons of ALS patients and mutations in TDP-43 and FUS cause familial ALS. Recently, perturbed RNA-binding proteostasis of TAF15, which is the 2nd ranked RRM-bearing prion candidate, has been connected with ALS and FTLD-U. We strongly suspect that we have now merely reached the tip of the iceberg. We predict that additional RNA-binding prion candidates identified by our algorithm will soon surface as genetic modifiers or causes of diverse neurodegenerative conditions. Indeed, simple prion-like transfer mechanisms involving the prion-like domains of RNA-binding proteins could underlie the classical non-cell-autonomous emanation of neurodegenerative pathology from originating epicenters to neighboring portions of the nervous system. PMID:22445064
A Computational Algorithm for Functional Clustering of Proteome Dynamics During Development
Wang, Yaqun; Wang, Ningtao; Hao, Han; Guo, Yunqian; Zhen, Yan; Shi, Jisen; Wu, Rongling
2014-01-01
Phenotypic traits, such as seed development, are a consequence of complex biochemical interactions among genes, proteins and metabolites, but the underlying mechanisms that operate in a coordinated and sequential manner remain elusive. Here, we address this issue by developing a computational algorithm to monitor proteome changes during the course of trait development. The algorithm is built within the mixture-model framework in which each mixture component is modeled by a specific group of proteins that display a similar temporal pattern of expression in trait development. A nonparametric approach based on Legendre orthogonal polynomials was used to fit dynamic changes of protein expression, increasing the power and flexibility of protein clustering. By analyzing a dataset of proteomic dynamics during early embryogenesis of the Chinese fir, the algorithm has successfully identified several distinct types of proteins that coordinate with each other to determine seed development in this forest tree commercially and environmentally important to China. The algorithm will find its immediate applications for the characterization of mechanistic underpinnings for any other biological processes in which protein abundance plays a key role. PMID:24955031
2011-01-01
Background Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors. Results This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m=22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m=28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc. Conclusions The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences. PMID:21342579
P-Finder: Reconstruction of Signaling Networks from Protein-Protein Interactions and GO Annotations.
Young-Rae Cho; Yanan Xin; Speegle, Greg
2015-01-01
Because most complex genetic diseases are caused by defects of cell signaling, illuminating a signaling cascade is essential for understanding their mechanisms. We present three novel computational algorithms to reconstruct signaling networks between a starting protein and an ending protein using genome-wide protein-protein interaction (PPI) networks and gene ontology (GO) annotation data. A signaling network is represented as a directed acyclic graph in a merged form of multiple linear pathways. An advanced semantic similarity metric is applied for weighting PPIs as the preprocessing of all three methods. The first algorithm repeatedly extends the list of nodes based on path frequency towards an ending protein. The second algorithm repeatedly appends edges based on the occurrence of network motifs which indicate the link patterns more frequently appearing in a PPI network than in a random graph. The last algorithm uses the information propagation technique which iteratively updates edge orientations based on the path strength and merges the selected directed edges. Our experimental results demonstrate that the proposed algorithms achieve higher accuracy than previous methods when they are tested on well-studied pathways of S. cerevisiae. Furthermore, we introduce an interactive web application tool, called P-Finder, to visualize reconstructed signaling networks.
Martinez, Carlos A.; Barr, Kenneth; Kim, Ah-Ram; Reinitz, John
2013-01-01
Synthetic biology offers novel opportunities for elucidating transcriptional regulatory mechanisms and enhancer logic. Complex cis-regulatory sequences—like the ones driving expression of the Drosophila even-skipped gene—have proven difficult to design from existing knowledge, presumably due to the large number of protein-protein interactions needed to drive the correct expression patterns of genes in multicellular organisms. This work discusses two novel computational methods for the custom design of enhancers that employ a sophisticated, empirically validated transcriptional model, optimization algorithms, and synthetic biology. These synthetic elements have both utilitarian and academic value, including improving existing regulatory models as well as evolutionary questions. The first method involves the use of simulated annealing to explore the sequence space for synthetic enhancers whose expression output fit a given search criterion. The second method uses a novel optimization algorithm to find functionally accessible pathways between two enhancer sequences. These paths describe a set of mutations wherein the predicted expression pattern does not significantly vary at any point along the path. Both methods rely on a predictive mathematical framework that maps the enhancer sequence space to functional output. PMID:23732772
Prediction of Drug-Plasma Protein Binding Using Artificial Intelligence Based Algorithms.
Kumar, Rajnish; Sharma, Anju; Siddiqui, Mohammed Haris; Tiwari, Rajesh Kumar
2018-01-01
Plasma protein binding (PPB) has vital importance in the characterization of drug distribution in the systemic circulation. Unfavorable PPB can pose a negative effect on clinical development of promising drug candidates. The drug distribution properties should be considered at the initial phases of the drug design and development. Therefore, PPB prediction models are receiving an increased attention. In the current study, we present a systematic approach using Support vector machine, Artificial neural network, k- nearest neighbor, Probabilistic neural network, Partial least square and Linear discriminant analysis to relate various in vitro and in silico molecular descriptors to a diverse dataset of 736 drugs/drug-like compounds. The overall accuracy of Support vector machine with Radial basis function kernel came out to be comparatively better than the rest of the applied algorithms. The training set accuracy, validation set accuracy, precision, sensitivity, specificity and F1 score for the Suprort vector machine was found to be 89.73%, 89.97%, 92.56%, 87.26%, 91.97% and 0.898, respectively. This model can potentially be useful in screening of relevant drug candidates at the preliminary stages of drug design and development. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
OSPREY Predicts Resistance Mutations Using Positive and Negative Computational Protein Design.
Ojewole, Adegoke; Lowegard, Anna; Gainza, Pablo; Reeve, Stephanie M; Georgiev, Ivelin; Anderson, Amy C; Donald, Bruce R
2017-01-01
Drug resistance in protein targets is an increasingly common phenomenon that reduces the efficacy of both existing and new antibiotics. However, knowledge of future resistance mutations during pre-clinical phases of drug development would enable the design of novel antibiotics that are robust against not only known resistant mutants, but also against those that have not yet been clinically observed. Computational structure-based protein design (CSPD) is a transformative field that enables the prediction of protein sequences with desired biochemical properties such as binding affinity and specificity to a target. The use of CSPD to predict previously unseen resistance mutations represents one of the frontiers of computational protein design. In a recent study (Reeve et al. Proc Natl Acad Sci U S A 112(3):749-754, 2015), we used our OSPREY (Open Source Protein REdesign for You) suite of CSPD algorithms to prospectively predict resistance mutations that arise in the active site of the dihydrofolate reductase enzyme from methicillin-resistant Staphylococcus aureus (SaDHFR) in response to selective pressure from an experimental competitive inhibitor. We demonstrated that our top predicted candidates are indeed viable resistant mutants. Since that study, we have significantly enhanced the capabilities of OSPREY with not only improved modeling of backbone flexibility, but also efficient multi-state design, fast sparse approximations, partitioned continuous rotamers for more accurate energy bounds, and a computationally efficient representation of molecular-mechanics and quantum-mechanical energy functions. Here, using SaDHFR as an example, we present a protocol for resistance prediction using the latest version of OSPREY. Specifically, we show how to use a combination of positive and negative design to predict active site escape mutations that maintain the enzyme's catalytic function but selectively ablate binding of an inhibitor.
Structure-guided Protein Transition Modeling with a Probabilistic Roadmap Algorithm.
Maximova, Tatiana; Plaku, Erion; Shehu, Amarda
2016-07-07
Proteins are macromolecules in perpetual motion, switching between structural states to modulate their function. A detailed characterization of the precise yet complex relationship between protein structure, dynamics, and function requires elucidating transitions between functionally-relevant states. Doing so challenges both wet and dry laboratories, as protein dynamics involves disparate temporal scales. In this paper we present a novel, sampling-based algorithm to compute transition paths. The algorithm exploits two main ideas. First, it leverages known structures to initialize its search and define a reduced conformation space for rapid sampling. This is key to address the insufficient sampling issue suffered by sampling-based algorithms. Second, the algorithm embeds samples in a nearest-neighbor graph where transition paths can be efficiently computed via queries. The algorithm adapts the probabilistic roadmap framework that is popular in robot motion planning. In addition to efficiently computing lowest-cost paths between any given structures, the algorithm allows investigating hypotheses regarding the order of experimentally-known structures in a transition event. This novel contribution is likely to open up new venues of research. Detailed analysis is presented on multiple-basin proteins of relevance to human disease. Multiscaling and the AMBER ff14SB force field are used to obtain energetically-credible paths at atomistic detail.
PROPER: global protein interaction network alignment through percolation matching.
Kazemi, Ehsan; Hassani, Hamed; Grossglauser, Matthias; Pezeshgi Modarres, Hassan
2016-12-12
The alignment of protein-protein interaction (PPI) networks enables us to uncover the relationships between different species, which leads to a deeper understanding of biological systems. Network alignment can be used to transfer biological knowledge between species. Although different PPI-network alignment algorithms were introduced during the last decade, developing an accurate and scalable algorithm that can find alignments with high biological and structural similarities among PPI networks is still challenging. In this paper, we introduce a new global network alignment algorithm for PPI networks called PROPER. Compared to other global network alignment methods, our algorithm shows higher accuracy and speed over real PPI datasets and synthetic networks. We show that the PROPER algorithm can detect large portions of conserved biological pathways between species. Also, using a simple parsimonious evolutionary model, we explain why PROPER performs well based on several different comparison criteria. We highlight that PROPER has high potential in further applications such as detecting biological pathways, finding protein complexes and PPI prediction. The PROPER algorithm is available at http://proper.epfl.ch .
Automated Lead Optimization of MMP-12 Inhibitors Using a Genetic Algorithm.
Pickett, Stephen D; Green, Darren V S; Hunt, David L; Pardoe, David A; Hughes, Ian
2011-01-13
Traditional lead optimization projects involve long synthesis and testing cycles, favoring extensive structure-activity relationship (SAR) analysis and molecular design steps, in an attempt to limit the number of cycles that a project must run to optimize a development candidate. Microfluidic-based chemistry and biology platforms, with cycle times of minutes rather than weeks, lend themselves to unattended autonomous operation. The bottleneck in the lead optimization process is therefore shifted from synthesis or test to SAR analysis and design. As such, the way is open to an algorithm-directed process, without the need for detailed user data analysis. Here, we present results of two synthesis and screening experiments, undertaken using traditional methodology, to validate a genetic algorithm optimization process for future application to a microfluidic system. The algorithm has several novel features that are important for the intended application. For example, it is robust to missing data and can suggest compounds for retest to ensure reliability of optimization. The algorithm is first validated on a retrospective analysis of an in-house library embedded in a larger virtual array of presumed inactive compounds. In a second, prospective experiment with MMP-12 as the target protein, 140 compounds are submitted for synthesis over 10 cycles of optimization. Comparison is made to the results from the full combinatorial library that was synthesized manually and tested independently. The results show that compounds selected by the algorithm are heavily biased toward the more active regions of the library, while the algorithm is robust to both missing data (compounds where synthesis failed) and inactive compounds. This publication places the full combinatorial library and biological data into the public domain with the intention of advancing research into algorithm-directed lead optimization methods.
Automated Lead Optimization of MMP-12 Inhibitors Using a Genetic Algorithm
2010-01-01
Traditional lead optimization projects involve long synthesis and testing cycles, favoring extensive structure−activity relationship (SAR) analysis and molecular design steps, in an attempt to limit the number of cycles that a project must run to optimize a development candidate. Microfluidic-based chemistry and biology platforms, with cycle times of minutes rather than weeks, lend themselves to unattended autonomous operation. The bottleneck in the lead optimization process is therefore shifted from synthesis or test to SAR analysis and design. As such, the way is open to an algorithm-directed process, without the need for detailed user data analysis. Here, we present results of two synthesis and screening experiments, undertaken using traditional methodology, to validate a genetic algorithm optimization process for future application to a microfluidic system. The algorithm has several novel features that are important for the intended application. For example, it is robust to missing data and can suggest compounds for retest to ensure reliability of optimization. The algorithm is first validated on a retrospective analysis of an in-house library embedded in a larger virtual array of presumed inactive compounds. In a second, prospective experiment with MMP-12 as the target protein, 140 compounds are submitted for synthesis over 10 cycles of optimization. Comparison is made to the results from the full combinatorial library that was synthesized manually and tested independently. The results show that compounds selected by the algorithm are heavily biased toward the more active regions of the library, while the algorithm is robust to both missing data (compounds where synthesis failed) and inactive compounds. This publication places the full combinatorial library and biological data into the public domain with the intention of advancing research into algorithm-directed lead optimization methods. PMID:24900251
Elaziz, Mohamed Abd; Hemdan, Ahmed Monem; Hassanien, AboulElla; Oliva, Diego; Xiong, Shengwu
2017-09-07
The current economics of the fish protein industry demand rapid, accurate and expressive prediction algorithms at every step of protein production especially with the challenge of global climate change. This help to predict and analyze functional and nutritional quality then consequently control food allergies in hyper allergic patients. As, it is quite expensive and time-consuming to know these concentrations by the lab experimental tests, especially to conduct large-scale projects. Therefore, this paper introduced a new intelligent algorithm using adaptive neuro-fuzzy inference system based on whale optimization algorithm. This algorithm is used to predict the concentration levels of bioactive amino acids in fish protein hydrolysates at different times during the year. The whale optimization algorithm is used to determine the optimal parameters in adaptive neuro-fuzzy inference system. The results of proposed algorithm are compared with others and it is indicated the higher performance of the proposed algorithm.
Computational Tools and Algorithms for Designing Customized Synthetic Genes
Gould, Nathan; Hendy, Oliver; Papamichail, Dimitris
2014-01-01
Advances in DNA synthesis have enabled the construction of artificial genes, gene circuits, and genomes of bacterial scale. Freedom in de novo design of synthetic constructs provides significant power in studying the impact of mutations in sequence features, and verifying hypotheses on the functional information that is encoded in nucleic and amino acids. To aid this goal, a large number of software tools of variable sophistication have been implemented, enabling the design of synthetic genes for sequence optimization based on rationally defined properties. The first generation of tools dealt predominantly with singular objectives such as codon usage optimization and unique restriction site incorporation. Recent years have seen the emergence of sequence design tools that aim to evolve sequences toward combinations of objectives. The design of optimal protein-coding sequences adhering to multiple objectives is computationally hard, and most tools rely on heuristics to sample the vast sequence design space. In this review, we study some of the algorithmic issues behind gene optimization and the approaches that different tools have adopted to redesign genes and optimize desired coding features. We utilize test cases to demonstrate the efficiency of each approach, as well as identify their strengths and limitations. PMID:25340050
2015-01-01
False negative docking outcomes for highly symmetric molecules are a barrier to the accurate evaluation of docking programs, scoring functions, and protocols. This work describes an implementation of a symmetry-corrected root-mean-square deviation (RMSD) method into the program DOCK based on the Hungarian algorithm for solving the minimum assignment problem, which dynamically assigns atom correspondence in molecules with symmetry. The algorithm adds only a trivial amount of computation time to the RMSD calculations and is shown to increase the reported overall docking success rate by approximately 5% when tested over 1043 receptor–ligand systems. For some families of protein systems the results are even more dramatic, with success rate increases up to 16.7%. Several additional applications of the method are also presented including as a pairwise similarity metric to compare molecules during de novo design, as a scoring function to rank-order virtual screening results, and for the analysis of trajectories from molecular dynamics simulation. The new method, including source code, is available to registered users of DOCK6 (http://dock.compbio.ucsf.edu). PMID:24410429
Maintaining and Enhancing Diversity of Sampled Protein Conformations in Robotics-Inspired Methods.
Abella, Jayvee R; Moll, Mark; Kavraki, Lydia E
2018-01-01
The ability to efficiently sample structurally diverse protein conformations allows one to gain a high-level view of a protein's energy landscape. Algorithms from robot motion planning have been used for conformational sampling, and several of these algorithms promote diversity by keeping track of "coverage" in conformational space based on the local sampling density. However, large proteins present special challenges. In particular, larger systems require running many concurrent instances of these algorithms, but these algorithms can quickly become memory intensive because they typically keep previously sampled conformations in memory to maintain coverage estimates. In addition, robotics-inspired algorithms depend on defining useful perturbation strategies for exploring the conformational space, which is a difficult task for large proteins because such systems are typically more constrained and exhibit complex motions. In this article, we introduce two methodologies for maintaining and enhancing diversity in robotics-inspired conformational sampling. The first method addresses algorithms based on coverage estimates and leverages the use of a low-dimensional projection to define a global coverage grid that maintains coverage across concurrent runs of sampling. The second method is an automatic definition of a perturbation strategy through readily available flexibility information derived from B-factors, secondary structure, and rigidity analysis. Our results show a significant increase in the diversity of the conformations sampled for proteins consisting of up to 500 residues when applied to a specific robotics-inspired algorithm for conformational sampling. The methodologies presented in this article may be vital components for the scalability of robotics-inspired approaches.
Identify High-Quality Protein Structural Models by Enhanced K-Means.
Wu, Hongjie; Li, Haiou; Jiang, Min; Chen, Cheng; Lv, Qiang; Wu, Chuang
2017-01-01
Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K -means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K -means clustering ( SK -means), whereas the other employs squared distance to optimize the initial centroids ( K -means++). Our results showed that SK -means and K -means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K -means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK -means and K -means++ demonstrated substantial improvements relative to results from SPICKER and classical K -means.
Identify High-Quality Protein Structural Models by Enhanced K-Means
Li, Haiou; Chen, Cheng; Lv, Qiang; Wu, Chuang
2017-01-01
Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed that SK-means and K-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK-means and K-means++ demonstrated substantial improvements relative to results from SPICKER and classical K-means. PMID:28421198
Padariya, Monikaben; Kalathiya, Umesh
2016-10-01
Fat mass and obesity-associated (FTO) protein contributes to non-syndromic human obesity which refers to excessive fat accumulation in human body and results in health risk. FTO protein has become a promising target for anti-obesity medicines as there is an immense need for the rational design of potent inhibitors to treat obesity. In our study, a new scaffold N-phenyl-1H-indol-2-amine was selected as a base for FTO protein inhibitors by applying scaffold hopping approach. Using this novel scaffold, different derivatives were designed by extending scaffold structure with potential functional groups. Molecular docking simulations were carried out by using two different docking algorithm implemented in CDOCKER (flexible docking) and AutoDock programs (rigid docking). Analyzing results of rigid and flexible docking, compound MU06 was selected based on different properties and predicted binding affinities for further analysis. Molecular dynamics simulation of FTO/MU06 complex was performed to characterize structure rationale and binding stability. Certainly, Arg96 and His231 residue of FTO protein showed stable interaction with inhibitor MU06 throughout the production dynamics phase. Three residues of FTO protein (Arg96, Asp233, and His231) were found common in making H-bond interactions with MU06 during molecular dynamics simulation and CDOCKER docking. Copyright © 2016 Elsevier Ltd. All rights reserved.
Boulila, Moncef; Ben Tiba, Sawssen; Jilani, Saoussen
2013-04-01
The sequence alignments of five Tunisian isolates of Prunus necrotic ringspot virus (PNRSV) were searched for evidence of recombination and diversifying selection. Since failing to account for recombination can elevate the false positive error rate in positive selection inference, a genetic algorithm (GARD) was used first and led to the detection of potential recombination events in the coat protein-encoding gene of that virus. The Recco algorithm confirmed these results by identifying, additionally, the potential recombinants. For neutrality testing and evaluation of nucleotide polymorphism in PNRSV CP gene, Tajima's D, and Fu and Li's D and F statistical tests were used. About selection inference, eight algorithms (SLAC, FEL, IFEL, REL, FUBAR, MEME, PARRIS, and GA branch) incorporated in HyPhy package were utilized to assess the selection pressure exerted on the expression of PNRSV capsid. Inferred phylogenies pointed out, in addition to the three classical groups (PE-5, PV-32, and PV-96), the delineation of a fourth cluster having the new proposed designation SW6, and a fifth clade comprising four Tunisian PNRSV isolates which underwent recombination and selective pressure and to which the name Tunisian outgroup was allocated.
CombiMotif: A new algorithm for network motifs discovery in protein-protein interaction networks
NASA Astrophysics Data System (ADS)
Luo, Jiawei; Li, Guanghui; Song, Dan; Liang, Cheng
2014-12-01
Discovering motifs in protein-protein interaction networks is becoming a current major challenge in computational biology, since the distribution of the number of network motifs can reveal significant systemic differences among species. However, this task can be computationally expensive because of the involvement of graph isomorphic detection. In this paper, we present a new algorithm (CombiMotif) that incorporates combinatorial techniques to count non-induced occurrences of subgraph topologies in the form of trees. The efficiency of our algorithm is demonstrated by comparing the obtained results with the current state-of-the art subgraph counting algorithms. We also show major differences between unicellular and multicellular organisms. The datasets and source code of CombiMotif are freely available upon request.
Surface energetics and protein-protein interactions: analysis and mechanistic implications
Peri, Claudio; Morra, Giulia; Colombo, Giorgio
2016-01-01
Understanding protein-protein interactions (PPI) at the molecular level is a fundamental task in the design of new drugs, the prediction of protein function and the clarification of the mechanisms of (dis)regulation of biochemical pathways. In this study, we use a novel computational approach to investigate the energetics of aminoacid networks located on the surface of proteins, isolated and in complex with their respective partners. Interestingly, the analysis of individual proteins identifies patches of surface residues that, when mapped on the structure of their respective complexes, reveal regions of residue-pair couplings that extend across the binding interfaces, forming continuous motifs. An enhanced effect is visible across the proteins of the dataset forming larger quaternary assemblies. The method indicates the presence of energetic signatures in the isolated proteins that are retained in the bound form, which we hypothesize to determine binding orientation upon complex formation. We propose our method, BLUEPRINT, as a complement to different approaches ranging from the ab-initio characterization of PPIs, to protein-protein docking algorithms, for the physico-chemical and functional investigation of protein-protein interactions. PMID:27050828
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Liang, Shide; Li, Liwei; Hsu, Wei-Lun; Pilcher, Meaghan N.; Uversky, Vladimir; Zhou, Yaoqi; Dunker, A. Keith; Meroueh, Samy O.
2009-01-01
The significant work that has been invested toward understanding protein–protein interaction has not translated into significant advances in structure-based predictions. In particular redesigning protein surfaces to bind to unrelated receptors remains a challenge, partly due to receptor flexibility, which is often neglected in these efforts. In this work, we computationally graft the binding epitope of various small proteins obtained from the RCSB database to bind to barnase, lysozyme, and trypsin using a previously derived and validated algorithm. In an effort to probe the protein complexes in a realistic environment, all native and designer complexes were subjected to a total of nearly 400 ns of explicit-solvent molecular dynamics (MD) simulation. The MD data led to an unexpected observation: some of the designer complexes were highly unstable and decomposed during the trajectories. In contrast, the native and a number of designer complexes remained consistently stable. The unstable conformers provided us with a unique opportunity to define the structural and energetic factors that lead to unproductive protein–protein complexes. To that end we used free energy calculations following the MM-PBSA approach to determine the role of nonpolar effects, electrostatics and entropy in binding. Remarkably, we found that a majority of unstable complexes exhibited more favorable electrostatics than native or stable designer complexes, suggesting that favorable electrostatic interactions are not prerequisite for complex formation between proteins. However, nonpolar effects remained consistently more favorable in native and stable designer complexes reinforcing the importance of hydrophobic effects in protein–protein binding. While entropy systematically opposed binding in all cases, there was no observed trend in the entropy difference between native and designer complexes. A series of alanine scanning mutations of hot-spot residues at the interface of native and designer complexes showed less than optimal contacts of hot-spot residues with their surroundings in the unstable conformers, resulting in more favorable entropy for these complexes. Finally, disorder predictions revealed that secondary structures at the interface of unstable complexes exhibited greater disorder than the stable complexes. PMID:19113835
PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank.
Tusnády, Gábor E; Dosztányi, Zsuzsanna; Simon, István
2005-01-01
PDB_TM is a database for transmembrane proteins with known structures. It aims to collect all transmembrane proteins that are deposited in the protein structure database (PDB) and to determine their membrane-spanning regions. These assignments are based on the TMDET algorithm, which uses only structural information to locate the most likely position of the lipid bilayer and to distinguish between transmembrane and globular proteins. This algorithm was applied to all PDB entries and the results were collected in the PDB_TM database. By using TMDET algorithm, the PDB_TM database can be automatically updated every week, keeping it synchronized with the latest PDB updates. The PDB_TM database is available at http://www.enzim.hu/PDB_TM.
Predicting Protein-Protein Interaction Sites with a Novel Membership Based Fuzzy SVM Classifier.
Sriwastava, Brijesh K; Basu, Subhadip; Maulik, Ujjwal
2015-01-01
Predicting residues that participate in protein-protein interactions (PPI) helps to identify, which amino acids are located at the interface. In this paper, we show that the performance of the classical support vector machine (SVM) algorithm can further be improved with the use of a custom-designed fuzzy membership function, for the partner-specific PPI interface prediction problem. We evaluated the performances of both classical SVM and fuzzy SVM (F-SVM) on the PPI databases of three different model proteomes of Homo sapiens, Escherichia coli and Saccharomyces Cerevisiae and calculated the statistical significance of the developed F-SVM over classical SVM algorithm. We also compared our performance with the available state-of-the-art fuzzy methods in this domain and observed significant performance improvements. To predict interaction sites in protein complexes, local composition of amino acids together with their physico-chemical characteristics are used, where the F-SVM based prediction method exploits the membership function for each pair of sequence fragments. The average F-SVM performance (area under ROC curve) on the test samples in 10-fold cross validation experiment are measured as 77.07, 78.39, and 74.91 percent for the aforementioned organisms respectively. Performances on independent test sets are obtained as 72.09, 73.24 and 82.74 percent respectively. The software is available for free download from http://code.google.com/p/cmater-bioinfo.
Pilla, Kala Bharath; Otting, Gottfried; Huber, Thomas
2017-03-07
Computational and nuclear magnetic resonance hybrid approaches provide efficient tools for 3D structure determination of small proteins, but currently available algorithms struggle to perform with larger proteins. Here we demonstrate a new computational algorithm that assembles the 3D structure of a protein from its constituent super-secondary structural motifs (Smotifs) with the help of pseudocontact shift (PCS) restraints for backbone amide protons, where the PCSs are produced from different metal centers. The algorithm, DINGO-PCS (3D assembly of Individual Smotifs to Near-native Geometry as Orchestrated by PCSs), employs the PCSs to recognize, orient, and assemble the constituent Smotifs of the target protein without any other experimental data or computational force fields. Using a universal Smotif database, the DINGO-PCS algorithm exhaustively enumerates any given Smotif. We benchmarked the program against ten different protein targets ranging from 100 to 220 residues with different topologies. For nine of these targets, the method was able to identify near-native Smotifs. Copyright © 2017 Elsevier Ltd. All rights reserved.
Theofilatos, Konstantinos; Pavlopoulou, Niki; Papasavvas, Christoforos; Likothanassis, Spiros; Dimitrakopoulos, Christos; Georgopoulos, Efstratios; Moschopoulos, Charalampos; Mavroudi, Seferina
2015-03-01
Proteins are considered to be the most important individual components of biological systems and they combine to form physical protein complexes which are responsible for certain molecular functions. Despite the large availability of protein-protein interaction (PPI) information, not much information is available about protein complexes. Experimental methods are limited in terms of time, efficiency, cost and performance constraints. Existing computational methods have provided encouraging preliminary results, but they phase certain disadvantages as they require parameter tuning, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one protein complex. In the present paper, we propose a new fully unsupervised methodology for predicting protein complexes from weighted PPI graphs. The proposed methodology is called evolutionary enhanced Markov clustering (EE-MC) and it is a hybrid combination of an adaptive evolutionary algorithm and a state-of-the-art clustering algorithm named enhanced Markov clustering. EE-MC was compared with state-of-the-art methodologies when applied to datasets from the human and the yeast Saccharomyces cerevisiae organisms. Using public available datasets, EE-MC outperformed existing methodologies (in some datasets the separation metric was increased by 10-20%). Moreover, when applied to new human datasets its performance was encouraging in the prediction of protein complexes which consist of proteins with high functional similarity. In specific, 5737 protein complexes were predicted and 72.58% of them are enriched for at least one gene ontology (GO) function term. EE-MC is by design able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. This fact was experimentally validated and moreover, new potentially true human protein complexes were suggested as candidates for further validation using experimental techniques. Copyright © 2015 Elsevier B.V. All rights reserved.
DAnTE: a statistical tool for quantitative analysis of –omics data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Polpitiya, Ashoka D.; Qian, Weijun; Jaitly, Navdeep
2008-05-03
DAnTE (Data Analysis Tool Extension) is a statistical tool designed to address challenges unique to quantitative bottom-up, shotgun proteomics data. This tool has also been demonstrated for microarray data and can easily be extended to other high-throughput data types. DAnTE features selected normalization methods, missing value imputation algorithms, peptide to protein rollup methods, an extensive array of plotting functions, and a comprehensive ANOVA scheme that can handle unbalanced data and random effects. The Graphical User Interface (GUI) is designed to be very intuitive and user friendly.
Principles for Predicting RNA Secondary Structure Design Difficulty.
Anderson-Lee, Jeff; Fisker, Eli; Kosaraju, Vineet; Wu, Michelle; Kong, Justin; Lee, Jeehyung; Lee, Minjae; Zada, Mathew; Treuille, Adrien; Das, Rhiju
2016-02-27
Designing RNAs that form specific secondary structures is enabling better understanding and control of living systems through RNA-guided silencing, genome editing and protein organization. Little is known, however, about which RNA secondary structures might be tractable for downstream sequence design, increasing the time and expense of design efforts due to inefficient secondary structure choices. Here, we present insights into specific structural features that increase the difficulty of finding sequences that fold into a target RNA secondary structure, summarizing the design efforts of tens of thousands of human participants and three automated algorithms (RNAInverse, INFO-RNA and RNA-SSD) in the Eterna massive open laboratory. Subsequent tests through three independent RNA design algorithms (NUPACK, DSS-Opt and MODENA) confirmed the hypothesized importance of several features in determining design difficulty, including sequence length, mean stem length, symmetry and specific difficult-to-design motifs such as zigzags. Based on these results, we have compiled an Eterna100 benchmark of 100 secondary structure design challenges that span a large range in design difficulty to help test future efforts. Our in silico results suggest new routes for improving computational RNA design methods and for extending these insights to assess "designability" of single RNA structures, as well as of switches for in vitro and in vivo applications. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.
Li, Min; Li, Dongyan; Tang, Yu; Wu, Fangxiang; Wang, Jianxin
2017-08-31
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks
Li, Min; Li, Dongyan; Tang, Yu; Wang, Jianxin
2017-01-01
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster. PMID:28858211
CHARMM-GUI Membrane Builder toward realistic biological membrane simulations.
Wu, Emilia L; Cheng, Xi; Jo, Sunhwan; Rui, Huan; Song, Kevin C; Dávila-Contreras, Eder M; Qi, Yifei; Lee, Jumin; Monje-Galvan, Viviana; Venable, Richard M; Klauda, Jeffery B; Im, Wonpil
2014-10-15
CHARMM-GUI Membrane Builder, http://www.charmm-gui.org/input/membrane, is a web-based user interface designed to interactively build all-atom protein/membrane or membrane-only systems for molecular dynamics simulations through an automated optimized process. In this work, we describe the new features and major improvements in Membrane Builder that allow users to robustly build realistic biological membrane systems, including (1) addition of new lipid types, such as phosphoinositides, cardiolipin (CL), sphingolipids, bacterial lipids, and ergosterol, yielding more than 180 lipid types, (2) enhanced building procedure for lipid packing around protein, (3) reliable algorithm to detect lipid tail penetration to ring structures and protein surface, (4) distance-based algorithm for faster initial ion displacement, (5) CHARMM inputs for P21 image transformation, and (6) NAMD equilibration and production inputs. The robustness of these new features is illustrated by building and simulating a membrane model of the polar and septal regions of E. coli membrane, which contains five lipid types: CL lipids with two types of acyl chains and phosphatidylethanolamine lipids with three types of acyl chains. It is our hope that CHARMM-GUI Membrane Builder becomes a useful tool for simulation studies to better understand the structure and dynamics of proteins and lipids in realistic biological membrane environments. Copyright © 2014 Wiley Periodicals, Inc.
Parallel protein secondary structure prediction based on neural networks.
Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi
2004-01-01
Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.
He, Jianjun; Gu, Hong; Liu, Wenqi
2012-01-01
It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches.
Evaluation of the novel algorithm of flexible ligand docking with moveable target-protein atoms.
Sulimov, Alexey V; Zheltkov, Dmitry A; Oferkin, Igor V; Kutov, Danil C; Katkova, Ekaterina V; Tyrtyshnikov, Eugene E; Sulimov, Vladimir B
2017-01-01
We present the novel docking algorithm based on the Tensor Train decomposition and the TT-Cross global optimization. The algorithm is applied to the docking problem with flexible ligand and moveable protein atoms. The energy of the protein-ligand complex is calculated in the frame of the MMFF94 force field in vacuum. The grid of precalculated energy potentials of probe ligand atoms in the field of the target protein atoms is not used. The energy of the protein-ligand complex for any given configuration is computed directly with the MMFF94 force field without any fitting parameters. The conformation space of the system coordinates is formed by translations and rotations of the ligand as a whole, by the ligand torsions and also by Cartesian coordinates of the selected target protein atoms. Mobility of protein and ligand atoms is taken into account in the docking process simultaneously and equally. The algorithm is realized in the novel parallel docking SOL-P program and results of its performance for a set of 30 protein-ligand complexes are presented. Dependence of the docking positioning accuracy is investigated as a function of parameters of the docking algorithm and the number of protein moveable atoms. It is shown that mobility of the protein atoms improves docking positioning accuracy. The SOL-P program is able to perform docking of a flexible ligand into the active site of the target protein with several dozens of protein moveable atoms: the native crystallized ligand pose is correctly found as the global energy minimum in the search space with 157 dimensions using 4700 CPU ∗ h at the Lomonosov supercomputer.
A generalized global alignment algorithm.
Huang, Xiaoqiu; Chao, Kun-Mao
2003-01-22
Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.
Normalized Cut Algorithm for Automated Assignment of Protein Domains
NASA Technical Reports Server (NTRS)
Samanta, M. P.; Liang, S.; Zha, H.; Biegel, Bryan A. (Technical Monitor)
2002-01-01
We present a novel computational method for automatic assignment of protein domains from structural data. At the core of our algorithm lies a recently proposed clustering technique that has been very successful for image-partitioning applications. This grap.,l-theory based clustering method uses the notion of a normalized cut to partition. an undirected graph into its strongly-connected components. Computer implementation of our method tested on the standard comparison set of proteins from the literature shows a high success rate (84%), better than most existing alternative In addition, several other features of our algorithm, such as reliance on few adjustable parameters, linear run-time with respect to the size of the protein and reduced complexity compared to other graph-theory based algorithms, would make it an attractive tool for structural biologists.
ClusterViz: A Cytoscape APP for Cluster Analysis of Biological Network.
Wang, Jianxin; Zhong, Jiancheng; Chen, Gang; Li, Min; Wu, Fang-xiang; Pan, Yi
2015-01-01
Cluster analysis of biological networks is one of the most important approaches for identifying functional modules and predicting protein functions. Furthermore, visualization of clustering results is crucial to uncover the structure of biological networks. In this paper, ClusterViz, an APP of Cytoscape 3 for cluster analysis and visualization, has been developed. In order to reduce complexity and enable extendibility for ClusterViz, we designed the architecture of ClusterViz based on the framework of Open Services Gateway Initiative. According to the architecture, the implementation of ClusterViz is partitioned into three modules including interface of ClusterViz, clustering algorithms and visualization and export. ClusterViz fascinates the comparison of the results of different algorithms to do further related analysis. Three commonly used clustering algorithms, FAG-EC, EAGLE and MCODE, are included in the current version. Due to adopting the abstract interface of algorithms in module of the clustering algorithms, more clustering algorithms can be included for the future use. To illustrate usability of ClusterViz, we provided three examples with detailed steps from the important scientific articles, which show that our tool has helped several research teams do their research work on the mechanism of the biological networks.
A global optimization algorithm for protein surface alignment
2010-01-01
Background A relevant problem in drug design is the comparison and recognition of protein binding sites. Binding sites recognition is generally based on geometry often combined with physico-chemical properties of the site since the conformation, size and chemical composition of the protein surface are all relevant for the interaction with a specific ligand. Several matching strategies have been designed for the recognition of protein-ligand binding sites and of protein-protein interfaces but the problem cannot be considered solved. Results In this paper we propose a new method for local structural alignment of protein surfaces based on continuous global optimization techniques. Given the three-dimensional structures of two proteins, the method finds the isometric transformation (rotation plus translation) that best superimposes active regions of two structures. We draw our inspiration from the well-known Iterative Closest Point (ICP) method for three-dimensional (3D) shapes registration. Our main contribution is in the adoption of a controlled random search as a more efficient global optimization approach along with a new dissimilarity measure. The reported computational experience and comparison show viability of the proposed approach. Conclusions Our method performs well to detect similarity in binding sites when this in fact exists. In the future we plan to do a more comprehensive evaluation of the method by considering large datasets of non-redundant proteins and applying a clustering technique to the results of all comparisons to classify binding sites. PMID:20920230
Du, Q S; Ma, Y; Xie, N Z; Huang, R B
2014-01-01
In the design of peptide inhibitors the huge possible variety of the peptide sequences is of high concern. In collaboration with the fast accumulation of the peptide experimental data and database, a statistical method is suggested for peptide inhibitor design. In the two-level peptide prediction network (2L-QSAR) one level is the physicochemical properties of amino acids and the other level is the peptide sequence position. The activity contributions of amino acids are the functions of physicochemical properties and the sequence positions. In the prediction equation two weight coefficient sets {ak} and {bl} are assigned to the physicochemical properties and to the sequence positions, respectively. After the two coefficient sets are optimized based on the experimental data of known peptide inhibitors using the iterative double least square (IDLS) procedure, the coefficients are used to evaluate the bioactivities of new designed peptide inhibitors. The two-level prediction network can be applied to the peptide inhibitor design that may aim for different target proteins, or different positions of a protein. A notable advantage of the two-level statistical algorithm is that there is no need for host protein structural information. It may also provide useful insight into the amino acid properties and the roles of sequence positions.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang
2014-01-01
Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Fenimore, Paul W.; Foley, Brian T.; Bakken, Russell R.; Thurmond, James R.; Yusim, Karina; Yoon, Hyejin; Parker, Michael; Hart, Mary Kate; Dye, John M.; Korber, Bette; Kuiken, Carla
2012-01-01
We report the rational design and in vivo testing of mosaic proteins for a polyvalent pan-filoviral vaccine using a computational strategy designed for the Human Immunodeficiency Virus type 1 (HIV-1) but also appropriate for Hepatitis C virus (HCV) and potentially other diverse viruses. Mosaics are sets of artificial recombinant proteins that are based on natural proteins. The recombinants are computationally selected using a genetic algorithm to optimize the coverage of potential cytotoxic T lymphocyte (CTL) epitopes. Because evolutionary history differs markedly between HIV-1 and filoviruses, we devised an adapted computational technique that is effective for sparsely sampled taxa; our first significant result is that the mosaic technique is effective in creating high-quality mosaic filovirus proteins. The resulting coverage of potential epitopes across filovirus species is superior to coverage by any natural variants, including current vaccine strains with demonstrated cross-reactivity. The mosaic cocktails are also robust: mosaics substantially outperformed natural strains when computationally tested against poorly sampled species and more variable genes. Furthermore, in a computational comparison of cross-reactive potential a design constructed prior to the Bundibugyo outbreak performed nearly as well against all species as an updated design that included Bundibugyo. These points suggest that the mosaic designs would be more resilient than natural-variant vaccines against future Ebola outbreaks dominated by novel viral variants. We demonstrate in vivo immunogenicity and protection against a heterologous challenge in a mouse model. This design work delineates the likely requirements and limitations on broadly-protective filoviral CTL vaccines. PMID:23056184
Towards human-computer synergetic analysis of large-scale biological data.
Singh, Rahul; Yang, Hui; Dalziel, Ben; Asarnow, Daniel; Murad, William; Foote, David; Gormley, Matthew; Stillman, Jonathan; Fisher, Susan
2013-01-01
Advances in technology have led to the generation of massive amounts of complex and multifarious biological data in areas ranging from genomics to structural biology. The volume and complexity of such data leads to significant challenges in terms of its analysis, especially when one seeks to generate hypotheses or explore the underlying biological processes. At the state-of-the-art, the application of automated algorithms followed by perusal and analysis of the results by an expert continues to be the predominant paradigm for analyzing biological data. This paradigm works well in many problem domains. However, it also is limiting, since domain experts are forced to apply their instincts and expertise such as contextual reasoning, hypothesis formulation, and exploratory analysis after the algorithm has produced its results. In many areas where the organization and interaction of the biological processes is poorly understood and exploratory analysis is crucial, what is needed is to integrate domain expertise during the data analysis process and use it to drive the analysis itself. In context of the aforementioned background, the results presented in this paper describe advancements along two methodological directions. First, given the context of biological data, we utilize and extend a design approach called experiential computing from multimedia information system design. This paradigm combines information visualization and human-computer interaction with algorithms for exploratory analysis of large-scale and complex data. In the proposed approach, emphasis is laid on: (1) allowing users to directly visualize, interact, experience, and explore the data through interoperable visualization-based and algorithmic components, (2) supporting unified query and presentation spaces to facilitate experimentation and exploration, (3) providing external contextual information by assimilating relevant supplementary data, and (4) encouraging user-directed information visualization, data exploration, and hypotheses formulation. Second, to illustrate the proposed design paradigm and measure its efficacy, we describe two prototype web applications. The first, called XMAS (Experiential Microarray Analysis System) is designed for analysis of time-series transcriptional data. The second system, called PSPACE (Protein Space Explorer) is designed for holistic analysis of structural and structure-function relationships using interactive low-dimensional maps of the protein structure space. Both these systems promote and facilitate human-computer synergy, where cognitive elements such as domain knowledge, contextual reasoning, and purpose-driven exploration, are integrated with a host of powerful algorithmic operations that support large-scale data analysis, multifaceted data visualization, and multi-source information integration. The proposed design philosophy, combines visualization, algorithmic components and cognitive expertise into a seamless processing-analysis-exploration framework that facilitates sense-making, exploration, and discovery. Using XMAS, we present case studies that analyze transcriptional data from two highly complex domains: gene expression in the placenta during human pregnancy and reaction of marine organisms to heat stress. With PSPACE, we demonstrate how complex structure-function relationships can be explored. These results demonstrate the novelty, advantages, and distinctions of the proposed paradigm. Furthermore, the results also highlight how domain insights can be combined with algorithms to discover meaningful knowledge and formulate evidence-based hypotheses during the data analysis process. Finally, user studies against comparable systems indicate that both XMAS and PSPACE deliver results with better interpretability while placing lower cognitive loads on the users. XMAS is available at: http://tintin.sfsu.edu:8080/xmas. PSPACE is available at: http://pspace.info/.
Towards human-computer synergetic analysis of large-scale biological data
2013-01-01
Background Advances in technology have led to the generation of massive amounts of complex and multifarious biological data in areas ranging from genomics to structural biology. The volume and complexity of such data leads to significant challenges in terms of its analysis, especially when one seeks to generate hypotheses or explore the underlying biological processes. At the state-of-the-art, the application of automated algorithms followed by perusal and analysis of the results by an expert continues to be the predominant paradigm for analyzing biological data. This paradigm works well in many problem domains. However, it also is limiting, since domain experts are forced to apply their instincts and expertise such as contextual reasoning, hypothesis formulation, and exploratory analysis after the algorithm has produced its results. In many areas where the organization and interaction of the biological processes is poorly understood and exploratory analysis is crucial, what is needed is to integrate domain expertise during the data analysis process and use it to drive the analysis itself. Results In context of the aforementioned background, the results presented in this paper describe advancements along two methodological directions. First, given the context of biological data, we utilize and extend a design approach called experiential computing from multimedia information system design. This paradigm combines information visualization and human-computer interaction with algorithms for exploratory analysis of large-scale and complex data. In the proposed approach, emphasis is laid on: (1) allowing users to directly visualize, interact, experience, and explore the data through interoperable visualization-based and algorithmic components, (2) supporting unified query and presentation spaces to facilitate experimentation and exploration, (3) providing external contextual information by assimilating relevant supplementary data, and (4) encouraging user-directed information visualization, data exploration, and hypotheses formulation. Second, to illustrate the proposed design paradigm and measure its efficacy, we describe two prototype web applications. The first, called XMAS (Experiential Microarray Analysis System) is designed for analysis of time-series transcriptional data. The second system, called PSPACE (Protein Space Explorer) is designed for holistic analysis of structural and structure-function relationships using interactive low-dimensional maps of the protein structure space. Both these systems promote and facilitate human-computer synergy, where cognitive elements such as domain knowledge, contextual reasoning, and purpose-driven exploration, are integrated with a host of powerful algorithmic operations that support large-scale data analysis, multifaceted data visualization, and multi-source information integration. Conclusions The proposed design philosophy, combines visualization, algorithmic components and cognitive expertise into a seamless processing-analysis-exploration framework that facilitates sense-making, exploration, and discovery. Using XMAS, we present case studies that analyze transcriptional data from two highly complex domains: gene expression in the placenta during human pregnancy and reaction of marine organisms to heat stress. With PSPACE, we demonstrate how complex structure-function relationships can be explored. These results demonstrate the novelty, advantages, and distinctions of the proposed paradigm. Furthermore, the results also highlight how domain insights can be combined with algorithms to discover meaningful knowledge and formulate evidence-based hypotheses during the data analysis process. Finally, user studies against comparable systems indicate that both XMAS and PSPACE deliver results with better interpretability while placing lower cognitive loads on the users. XMAS is available at: http://tintin.sfsu.edu:8080/xmas. PSPACE is available at: http://pspace.info/. PMID:24267485
Algorithms for database-dependent search of MS/MS data.
Matthiesen, Rune
2013-01-01
The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.
A Novel Algorithm for Detecting Protein Complexes with the Breadth First Search
Tang, Xiwei; Wang, Jianxin; Li, Min; He, Yiming; Pan, Yi
2014-01-01
Most biological processes are carried out by protein complexes. A substantial number of false positives of the protein-protein interaction (PPI) data can compromise the utility of the datasets for complexes reconstruction. In order to reduce the impact of such discrepancies, a number of data integration and affinity scoring schemes have been devised. The methods encode the reliabilities (confidence) of physical interactions between pairs of proteins. The challenge now is to identify novel and meaningful protein complexes from the weighted PPI network. To address this problem, a novel protein complex mining algorithm ClusterBFS (Cluster with Breadth-First Search) is proposed. Based on the weighted density, ClusterBFS detects protein complexes of the weighted network by the breadth first search algorithm, which originates from a given seed protein used as starting-point. The experimental results show that ClusterBFS performs significantly better than the other computational approaches in terms of the identification of protein complexes. PMID:24818139
2014-01-01
Background Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers. Results We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins. Conclusions The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines. PMID:24843788
Schmidt Am Busch, Marcel; Sedano, Audrey; Simonson, Thomas
2010-05-05
Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.
muBLASTP: database-indexed protein sequence search on multicore CPUs.
Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun
2016-11-04
The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
Automated sequence-specific protein NMR assignment using the memetic algorithm MATCH.
Volk, Jochen; Herrmann, Torsten; Wüthrich, Kurt
2008-07-01
MATCH (Memetic Algorithm and Combinatorial Optimization Heuristics) is a new memetic algorithm for automated sequence-specific polypeptide backbone NMR assignment of proteins. MATCH employs local optimization for tracing partial sequence-specific assignments within a global, population-based search environment, where the simultaneous application of local and global optimization heuristics guarantees high efficiency and robustness. MATCH thus makes combined use of the two predominant concepts in use for automated NMR assignment of proteins. Dynamic transition and inherent mutation are new techniques that enable automatic adaptation to variable quality of the experimental input data. The concept of dynamic transition is incorporated in all major building blocks of the algorithm, where it enables switching between local and global optimization heuristics at any time during the assignment process. Inherent mutation restricts the intrinsically required randomness of the evolutionary algorithm to those regions of the conformation space that are compatible with the experimental input data. Using intact and artificially deteriorated APSY-NMR input data of proteins, MATCH performed sequence-specific resonance assignment with high efficiency and robustness.
Santi, Melissa; Maccari, Giuseppe; Mereghetti, Paolo; Voliani, Valerio; Rocchiccioli, Silvia; Ucciferri, Nadia; Luin, Stefano; Signore, Giovanni
2017-02-15
The transferrin receptor (TfR) is a promising target in cancer therapy owing to its overexpression in most solid tumors and on the blood-brain barrier. Nanostructures chemically derivatized with transferrin are employed in TfR targeting but often lose their functionality upon injection in the bloodstream. As an alternative strategy, we rationally designed a peptide coating able to bind transferrin on suitable pockets not involved in binding to TfR or iron by using an iterative multiscale-modeling approach coupled with quantitative structure-activity and relationship (QSAR) analysis and evolutionary algorithms. We tested that selected sequences have low aspecific protein adsorption and high binding energy toward transferrin, and one of them is efficiently internalized in cells with a transferrin-dependent pathway. Furthermore, it promotes transferrin-mediated endocytosis of gold nanoparticles by modifying their protein corona and promoting oriented adsorption of transferrin. This strategy leads to highly effective nanostructures, potentially useful in diagnostic and therapeutic applications, which exploit (and do not suffer) the protein solvation for achieving a better targeting.
Bio-inspired algorithms applied to molecular docking simulations.
Heberlé, G; de Azevedo, W F
2011-01-01
Nature as a source of inspiration has been shown to have a great beneficial impact on the development of new computational methodologies. In this scenario, analyses of the interactions between a protein target and a ligand can be simulated by biologically inspired algorithms (BIAs). These algorithms mimic biological systems to create new paradigms for computation, such as neural networks, evolutionary computing, and swarm intelligence. This review provides a description of the main concepts behind BIAs applied to molecular docking simulations. Special attention is devoted to evolutionary algorithms, guided-directed evolutionary algorithms, and Lamarckian genetic algorithms. Recent applications of these methodologies to protein targets identified in the Mycobacterium tuberculosis genome are described.
Identifying Haemophilus haemolyticus and Haemophilus influenzae by SYBR Green real-time PCR.
Latham, Roger; Zhang, Bowen; Tristram, Stephen
2015-05-01
SYBR Green real time PCR assays for protein D (hpd), fuculose kinase (fucK) and [Cu, Zn]-superoxide dismutase (sodC) were designed for use in an algorithm for the identification of Haemophilus influenzae and H. haemolyticus. When tested on 127 H. influenzae and 60 H. haemolyticus all isolates were identified correctly. Crown Copyright © 2015. Published by Elsevier B.V. All rights reserved.
Sasse, Alexander; de Vries, Sjoerd J; Schindler, Christina E M; de Beauchêne, Isaure Chauvot; Zacharias, Martin
2017-01-01
Protein-protein docking protocols aim to predict the structures of protein-protein complexes based on the structure of individual partners. Docking protocols usually include several steps of sampling, clustering, refinement and re-scoring. The scoring step is one of the bottlenecks in the performance of many state-of-the-art protocols. The performance of scoring functions depends on the quality of the generated structures and its coupling to the sampling algorithm. A tool kit, GRADSCOPT (GRid Accelerated Directly SCoring OPTimizing), was designed to allow rapid development and optimization of different knowledge-based scoring potentials for specific objectives in protein-protein docking. Different atomistic and coarse-grained potentials can be created by a grid-accelerated directly scoring dependent Monte-Carlo annealing or by a linear regression optimization. We demonstrate that the scoring functions generated by our approach are similar to or even outperform state-of-the-art scoring functions for predicting near-native solutions. Of additional importance, we find that potentials specifically trained to identify the native bound complex perform rather poorly on identifying acceptable or medium quality (near-native) solutions. In contrast, atomistic long-range contact potentials can increase the average fraction of near-native poses by up to a factor 2.5 in the best scored 1% decoys (compared to existing scoring), emphasizing the need of specific docking potentials for different steps in the docking protocol.
Salt bridges: geometrically specific, designable interactions.
Donald, Jason E; Kulp, Daniel W; DeGrado, William F
2011-03-01
Salt bridges occur frequently in proteins, providing conformational specificity and contributing to molecular recognition and catalysis. We present a comprehensive analysis of these interactions in protein structures by surveying a large database of protein structures. Salt bridges between Asp or Glu and His, Arg, or Lys display extremely well-defined geometric preferences. Several previously observed preferences are confirmed, and others that were previously unrecognized are discovered. Salt bridges are explored for their preferences for different separations in sequence and in space, geometric preferences within proteins and at protein-protein interfaces, co-operativity in networked salt bridges, inclusion within metal-binding sites, preference for acidic electrons, apparent conformational side chain entropy reduction on formation, and degree of burial. Salt bridges occur far more frequently between residues at close than distant sequence separations, but, at close distances, there remain strong preferences for salt bridges at specific separations. Specific types of complex salt bridges, involving three or more members, are also discovered. As we observe a strong relationship between the propensity to form a salt bridge and the placement of salt-bridging residues in protein sequences, we discuss the role that salt bridges might play in kinetically influencing protein folding and thermodynamically stabilizing the native conformation. We also develop a quantitative method to select appropriate crystal structure resolution and B-factor cutoffs. Detailed knowledge of these geometric and sequence dependences should aid de novo design and prediction algorithms. Copyright © 2010 Wiley-Liss, Inc.
Geometric Detection Algorithms for Cavities on Protein Surfaces in Molecular Graphics: A Survey
Simões, Tiago; Lopes, Daniel; Dias, Sérgio; Fernandes, Francisco; Pereira, João; Jorge, Joaquim; Bajaj, Chandrajit; Gomes, Abel
2017-01-01
Detecting and analyzing protein cavities provides significant information about active sites for biological processes (e.g., protein-protein or protein-ligand binding) in molecular graphics and modeling. Using the three-dimensional structure of a given protein (i.e., atom types and their locations in 3D) as retrieved from a PDB (Protein Data Bank) file, it is now computationally viable to determine a description of these cavities. Such cavities correspond to pockets, clefts, invaginations, voids, tunnels, channels, and grooves on the surface of a given protein. In this work, we survey the literature on protein cavity computation and classify algorithmic approaches into three categories: evolution-based, energy-based, and geometry-based. Our survey focuses on geometric algorithms, whose taxonomy is extended to include not only sphere-, grid-, and tessellation-based methods, but also surface-based, hybrid geometric, consensus, and time-varying methods. Finally, we detail those techniques that have been customized for GPU (Graphics Processing Unit) computing. PMID:29520122
Improved modeling of side-chain--base interactions and plasticity in protein--DNA interface design.
Thyme, Summer B; Baker, David; Bradley, Philip
2012-06-08
Combinatorial sequence optimization for protein design requires libraries of discrete side-chain conformations. The discreteness of these libraries is problematic, particularly for long, polar side chains, since favorable interactions can be missed. Previously, an approach to loop remodeling where protein backbone movement is directed by side-chain rotamers predicted to form interactions previously observed in native complexes (termed "motifs") was described. Here, we show how such motif libraries can be incorporated into combinatorial sequence optimization protocols and improve native complex recapitulation. Guided by the motif rotamer searches, we made improvements to the underlying energy function, increasing recapitulation of native interactions. To further test the methods, we carried out a comprehensive experimental scan of amino acid preferences in the I-AniI protein-DNA interface and found that many positions tolerated multiple amino acids. This sequence plasticity is not observed in the computational results because of the fixed-backbone approximation of the model. We improved modeling of this diversity by introducing DNA flexibility and reducing the convergence of the simulated annealing algorithm that drives the design process. In addition to serving as a benchmark, this extensive experimental data set provides insight into the types of interactions essential to maintain the function of this potential gene therapy reagent. Published by Elsevier Ltd.
Optical Detection of Degraded Therapeutic Proteins.
Herrington, William F; Singh, Gajendra P; Wu, Di; Barone, Paul W; Hancock, William; Ram, Rajeev J
2018-03-23
The quality of therapeutic proteins such as hormones, subunit and conjugate vaccines, and antibodies is critical to the safety and efficacy of modern medicine. Identifying malformed proteins at the point-of-care can prevent adverse immune reactions in patients; this is of special concern when there is an insecure supply chain resulting in the delivery of degraded, or even counterfeit, drug product. Identification of degraded protein, for example human growth hormone, is demonstrated by applying automated anomaly detection algorithms. Detection of the degraded protein differs from previous applications of machine-learning and classification to spectral analysis: only example spectra of genuine, high-quality drug products are used to construct the classifier. The algorithm is tested on Raman spectra acquired on protein dilutions typical of formulated drug product and at sample volumes of 25 µL, below the typical overfill (waste) volumes present in vials of injectable drug product. The algorithm is demonstrated to correctly classify anomalous recombinant human growth hormone (rhGH) with 92% sensitivity and 98% specificity even when the algorithm has only previously encountered high-quality drug product.
An in-silico method for identifying aggregation rate enhancer and mitigator mutations in proteins.
Rawat, Puneet; Kumar, Sandeep; Michael Gromiha, M
2018-06-24
Newly synthesized polypeptides must pass stringent quality controls in cells to ensure appropriate folding and function. However, mutations, environmental stresses and aging can reduce efficiencies of these controls, leading to accumulation of protein aggregates, amyloid fibrils and plaques. In-vitro experiments have shown that even single amino acid substitutions can drastically enhance or mitigate protein aggregation kinetics. In this work, we have collected a dataset of 220 unique mutations in 25 proteins and classified them as enhancers or mitigators on the basis of their effect on protein aggregation rate. The data were analyzed via machine learning to identify features capable of distinguishing between aggregation rate enhancers and mitigators. Our initial Support Vector Machine (SVM) model separated such mutations with an overall accuracy of 69%. When local secondary structures at the mutation sites were considered, the accuracies further improved by 13-15%. The machine-learnt features are distinct for each secondary structure class at mutation sites. Protein stability and flexibility changes are important features for mutations in α-helices. β-strand propensity, polarity and charge become important when mutations occur in β-strands and ability to form secondary structure, helical tendency and aggregation propensity are important for mutations lying in coils. These results have been incorporated into a sequence-based algorithm (available at http://www.iitm.ac.in/bioinfo/aggrerate-disc/) capable of predicting whether a mutation will enhance or mitigate a protein's aggregation rate. This algorithm will find several applications towards understanding protein aggregation in human diseases, enable in-silico optimization of biopharmaceuticals and enzymes for improved biophysical attributes and de novo design of bio-nanomaterials. Copyright © 2018. Published by Elsevier B.V.
CaMELS: In silico prediction of calmodulin binding proteins and their binding sites.
Abbasi, Wajid Arshad; Asif, Amina; Andleeb, Saiqa; Minhas, Fayyaz Ul Amir Afsar
2017-09-01
Due to Ca 2+ -dependent binding and the sequence diversity of Calmodulin (CaM) binding proteins, identifying CaM interactions and binding sites in the wet-lab is tedious and costly. Therefore, computational methods for this purpose are crucial to the design of such wet-lab experiments. We present an algorithm suite called CaMELS (CalModulin intEraction Learning System) for predicting proteins that interact with CaM as well as their binding sites using sequence information alone. CaMELS offers state of the art accuracy for both CaM interaction and binding site prediction and can aid biologists in studying CaM binding proteins. For CaM interaction prediction, CaMELS uses protein sequence features coupled with a large-margin classifier. CaMELS models the binding site prediction problem using multiple instance machine learning with a custom optimization algorithm which allows more effective learning over imprecisely annotated CaM-binding sites during training. CaMELS has been extensively benchmarked using a variety of data sets, mutagenic studies, proteome-wide Gene Ontology enrichment analyses and protein structures. Our experiments indicate that CaMELS outperforms simple motif-based search and other existing methods for interaction and binding site prediction. We have also found that the whole sequence of a protein, rather than just its binding site, is important for predicting its interaction with CaM. Using the machine learning model in CaMELS, we have identified important features of protein sequences for CaM interaction prediction as well as characteristic amino acid sub-sequences and their relative position for identifying CaM binding sites. Python code for training and evaluating CaMELS together with a webserver implementation is available at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#camels. © 2017 Wiley Periodicals, Inc.
Semantic integration to identify overlapping functional modules in protein interaction networks
Cho, Young-Rae; Hwang, Woochang; Ramanathan, Murali; Zhang, Aidong
2007-01-01
Background The systematic analysis of protein-protein interactions can enable a better understanding of cellular organization, processes and functions. Functional modules can be identified from the protein interaction networks derived from experimental data sets. However, these analyses are challenging because of the presence of unreliable interactions and the complex connectivity of the network. The integration of protein-protein interactions with the data from other sources can be leveraged for improving the effectiveness of functional module detection algorithms. Results We have developed novel metrics, called semantic similarity and semantic interactivity, which use Gene Ontology (GO) annotations to measure the reliability of protein-protein interactions. The protein interaction networks can be converted into a weighted graph representation by assigning the reliability values to each interaction as a weight. We presented a flow-based modularization algorithm to efficiently identify overlapping modules in the weighted interaction networks. The experimental results show that the semantic similarity and semantic interactivity of interacting pairs were positively correlated with functional co-occurrence. The effectiveness of the algorithm for identifying modules was evaluated using functional categories from the MIPS database. We demonstrated that our algorithm had higher accuracy compared to other competing approaches. Conclusion The integration of protein interaction networks with GO annotation data and the capability of detecting overlapping modules substantially improve the accuracy of module identification. PMID:17650343
An Algorithm for Protein Helix Assignment Using Helix Geometry
Cao, Chen; Xu, Shutan; Wang, Lincong
2015-01-01
Helices are one of the most common and were among the earliest recognized secondary structure elements in proteins. The assignment of helices in a protein underlies the analysis of its structure and function. Though the mathematical expression for a helical curve is simple, no previous assignment programs have used a genuine helical curve as a model for helix assignment. In this paper we present a two-step assignment algorithm. The first step searches for a series of bona fide helical curves each one best fits the coordinates of four successive backbone Cα atoms. The second step uses the best fit helical curves as input to make helix assignment. The application to the protein structures in the PDB (protein data bank) proves that the algorithm is able to assign accurately not only regular α-helix but also 310 and π helices as well as their left-handed versions. One salient feature of the algorithm is that the assigned helices are structurally more uniform than those by the previous programs. The structural uniformity should be useful for protein structure classification and prediction while the accurate assignment of a helix to a particular type underlies structure-function relationship in proteins. PMID:26132394
A linear programming model for protein inference problem in shotgun proteomics.
Huang, Ting; He, Zengyou
2012-11-15
Assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is an important issue in shotgun proteomics. The objective of protein inference is to find a subset of proteins that are truly present in the sample. Although many methods have been proposed for protein inference, several issues such as peptide degeneracy still remain unsolved. In this article, we present a linear programming model for protein inference. In this model, we use a transformation of the joint probability that each peptide/protein pair is present in the sample as the variable. Then, both the peptide probability and protein probability can be expressed as a formula in terms of the linear combination of these variables. Based on this simple fact, the protein inference problem is formulated as an optimization problem: minimize the number of proteins with non-zero probabilities under the constraint that the difference between the calculated peptide probability and the peptide probability generated from peptide identification algorithms should be less than some threshold. This model addresses the peptide degeneracy issue by forcing some joint probability variables involving degenerate peptides to be zero in a rigorous manner. The corresponding inference algorithm is named as ProteinLP. We test the performance of ProteinLP on six datasets. Experimental results show that our method is competitive with the state-of-the-art protein inference algorithms. The source code of our algorithm is available at: https://sourceforge.net/projects/prolp/. zyhe@dlut.edu.cn. Supplementary data are available at Bioinformatics Online.
Energy Fluctuations Shape Free Energy of Nonspecific Biomolecular Interactions
NASA Astrophysics Data System (ADS)
Elkin, Michael; Andre, Ingemar; Lukatsky, David B.
2012-01-01
Understanding design principles of biomolecular recognition is a key question of molecular biology. Yet the enormous complexity and diversity of biological molecules hamper the efforts to gain a predictive ability for the free energy of protein-protein, protein-DNA, and protein-RNA binding. Here, using a variant of the Derrida model, we predict that for a large class of biomolecular interactions, it is possible to accurately estimate the relative free energy of binding based on the fluctuation properties of their energy spectra, even if a finite number of the energy levels is known. We show that the free energy of the system possessing a wider binding energy spectrum is almost surely lower compared with the system possessing a narrower energy spectrum. Our predictions imply that low-affinity binding scores, usually wasted in protein-protein and protein-DNA docking algorithms, can be efficiently utilized to compute the free energy. Using the results of Rosetta docking simulations of protein-protein interactions from Andre et al. (Proc. Natl. Acad. Sci. USA 105:16148, 2008), we demonstrate the power of our predictions.
Zhang, Guangya; Ge, Huihua
2013-10-01
Understanding of proteins adaptive to hypersaline environment and identifying them is a challenging task and would help to design stable proteins. Here, we have systematically analyzed the normalized amino acid compositions of 2121 halophilic and 2400 non-halophilic proteins. The results showed that halophilic protein contained more Asp at the expense of Lys, Ile, Cys and Met, fewer small and hydrophobic residues, and showed a large excess of acidic over basic amino acids. Then, we introduce a support vector machine method to discriminate the halophilic and non-halophilic proteins, by using a novel Pearson VII universal function based kernel. In the three validation check methods, it achieved an overall accuracy of 97.7%, 91.7% and 86.9% and outperformed other machine learning algorithms. We also address the influence of protein size on prediction accuracy and found the worse performance for small size proteins might be some significant residues (Cys and Lys) were missing in the proteins. Copyright © 2013 The Authors. Published by Elsevier Ltd.. All rights reserved.
Automatic classification of protein structures relying on similarities between alignments
2012-01-01
Background Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. Results When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments. Conclusions We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP. PMID:22974051
A novel algorithm for validating peptide identification from a shotgun proteomics search engine.
Jian, Ling; Niu, Xinnan; Xia, Zhonghang; Samir, Parimal; Sumanasekera, Chiranthani; Mu, Zheng; Jennings, Jennifer L; Hoek, Kristen L; Allos, Tara; Howard, Leigh M; Edwards, Kathryn M; Weil, P Anthony; Link, Andrew J
2013-03-01
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has revolutionized the proteomics analysis of complexes, cells, and tissues. In a typical proteomic analysis, the tandem mass spectra from a LC-MS/MS experiment are assigned to a peptide by a search engine that compares the experimental MS/MS peptide data to theoretical peptide sequences in a protein database. The peptide spectra matches are then used to infer a list of identified proteins in the original sample. However, the search engines often fail to distinguish between correct and incorrect peptides assignments. In this study, we designed and implemented a novel algorithm called De-Noise to reduce the number of incorrect peptide matches and maximize the number of correct peptides at a fixed false discovery rate using a minimal number of scoring outputs from the SEQUEST search engine. The novel algorithm uses a three-step process: data cleaning, data refining through a SVM-based decision function, and a final data refining step based on proteolytic peptide patterns. Using proteomics data generated on different types of mass spectrometers, we optimized the De-Noise algorithm on the basis of the resolution and mass accuracy of the mass spectrometer employed in the LC-MS/MS experiment. Our results demonstrate De-Noise improves peptide identification compared to other methods used to process the peptide sequence matches assigned by SEQUEST. Because De-Noise uses a limited number of scoring attributes, it can be easily implemented with other search engines.
Protein complex prediction for large protein protein interaction networks with the Core&Peel method.
Pellegrini, Marco; Baglioni, Miriam; Geraci, Filippo
2016-11-08
Biological networks play an increasingly important role in the exploration of functional modularity and cellular organization at a systemic level. Quite often the first tools used to analyze these networks are clustering algorithms. We concentrate here on the specific task of predicting protein complexes (PC) in large protein-protein interaction networks (PPIN). Currently, many state-of-the-art algorithms work well for networks of small or moderate size. However, their performance on much larger networks, which are becoming increasingly common in modern proteome-wise studies, needs to be re-assessed. We present a new fast algorithm for clustering large sparse networks: Core&Peel, which runs essentially in time and storage O(a(G)m+n) for a network G of n nodes and m arcs, where a(G) is the arboricity of G (which is roughly proportional to the maximum average degree of any induced subgraph in G). We evaluated Core&Peel on five PPI networks of large size and one of medium size from both yeast and homo sapiens, comparing its performance against those of ten state-of-the-art methods. We demonstrate that Core&Peel consistently outperforms the ten competitors in its ability to identify known protein complexes and in the functional coherence of its predictions. Our method is remarkably robust, being quite insensible to the injection of random interactions. Core&Peel is also empirically efficient attaining the second best running time over large networks among the tested algorithms. Our algorithm Core&Peel pushes forward the state-of the-art in PPIN clustering providing an algorithmic solution with polynomial running time that attains experimentally demonstrable good output quality and speed on challenging large real networks.
Architecture of the 99 bp DNA-six-protein regulatory complex of the lambda att site.
Sun, Xingmin; Mierke, Dale F; Biswas, Tapan; Lee, Sang Yeol; Landy, Arthur; Radman-Livaja, Marta
2006-11-17
The highly directional and tightly regulated recombination reaction used to site-specifically excise the bacteriophage lambda chromosome out of its E. coli host chromosome requires the binding of six sequence-specific proteins to a 99 bp segment of the phage att site. To gain structural insights into this recombination pathway, we measured 27 FRET distances between eight points on the 99 bp regulatory DNA bound with all six proteins. Triangulation of these distances using a metric matrix distance-geometry algorithm provided coordinates for these eight points. The resulting path for the protein-bound regulatory DNA, which fits well with the genetics, biochemistry, and X-ray crystal structures describing the individual proteins and their interactions with DNA, provides a new structural perspective into the molecular mechanism and regulation of the recombination reaction and illustrates a design by which different families of higher-order complexes can be assembled from different numbers and combinations of the same few proteins.
Youngs, Noah; Penfold-Brown, Duncan; Drew, Kevin; Shasha, Dennis; Bonneau, Richard
2013-05-01
Computational biologists have demonstrated the utility of using machine learning methods to predict protein function from an integration of multiple genome-wide data types. Yet, even the best performing function prediction algorithms rely on heuristics for important components of the algorithm, such as choosing negative examples (proteins without a given function) or determining key parameters. The improper choice of negative examples, in particular, can hamper the accuracy of protein function prediction. We present a novel approach for choosing negative examples, using a parameterizable Bayesian prior computed from all observed annotation data, which also generates priors used during function prediction. We incorporate this new method into the GeneMANIA function prediction algorithm and demonstrate improved accuracy of our algorithm over current top-performing function prediction methods on the yeast and mouse proteomes across all metrics tested. Code and Data are available at: http://bonneaulab.bio.nyu.edu/funcprop.html
Tian, Xin; Xin, Mingyuan; Luo, Jian; Liu, Mingyao; Jiang, Zhenran
2017-02-01
The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein-protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.
Computational approaches for the classification of seed storage proteins.
Radhika, V; Rao, V Sree Hari
2015-07-01
Seed storage proteins comprise a major part of the protein content of the seed and have an important role on the quality of the seed. These storage proteins are important because they determine the total protein content and have an effect on the nutritional quality and functional properties for food processing. Transgenic plants are being used to develop improved lines for incorporation into plant breeding programs and the nutrient composition of seeds is a major target of molecular breeding programs. Hence, classification of these proteins is crucial for the development of superior varieties with improved nutritional quality. In this study we have applied machine learning algorithms for classification of seed storage proteins. We have presented an algorithm based on nearest neighbor approach for classification of seed storage proteins and compared its performance with decision tree J48, multilayer perceptron neural (MLP) network and support vector machine (SVM) libSVM. The model based on our algorithm has been able to give higher classification accuracy in comparison to the other methods.
Hummer, Gerhard
2015-01-01
We present a new algorithm for simulating reaction-diffusion equations at single-particle resolution. Our algorithm is designed to be both accurate and simple to implement, and to be applicable to large and heterogeneous systems, including those arising in systems biology applications. We combine the use of the exact Green's function for a pair of reacting particles with the approximate free-diffusion propagator for position updates to particles. Trajectory reweighting in our free-propagator reweighting (FPR) method recovers the exact association rates for a pair of interacting particles at all times. FPR simulations of many-body systems accurately reproduce the theoretically known dynamic behavior for a variety of different reaction types. FPR does not suffer from the loss of efficiency common to other path-reweighting schemes, first, because corrections apply only in the immediate vicinity of reacting particles and, second, because by construction the average weight factor equals one upon leaving this reaction zone. FPR applications include the modeling of pathways and networks of protein-driven processes where reaction rates can vary widely and thousands of proteins may participate in the formation of large assemblies. With a limited amount of bookkeeping necessary to ensure proper association rates for each reactant pair, FPR can account for changes to reaction rates or diffusion constants as a result of reaction events. Importantly, FPR can also be extended to physical descriptions of protein interactions with long-range forces, as we demonstrate here for Coulombic interactions. PMID:26005592
Huang, Xiaoqiang; Han, Kehang; Zhu, Yushan
2013-01-01
A systematic optimization model for binding sequence selection in computational enzyme design was developed based on the transition state theory of enzyme catalysis and graph-theoretical modeling. The saddle point on the free energy surface of the reaction system was represented by catalytic geometrical constraints, and the binding energy between the active site and transition state was minimized to reduce the activation energy barrier. The resulting hyperscale combinatorial optimization problem was tackled using a novel heuristic global optimization algorithm, which was inspired and tested by the protein core sequence selection problem. The sequence recapitulation tests on native active sites for two enzyme catalyzed hydrolytic reactions were applied to evaluate the predictive power of the design methodology. The results of the calculation show that most of the native binding sites can be successfully identified if the catalytic geometrical constraints and the structural motifs of the substrate are taken into account. Reliably predicting active site sequences may have significant implications for the creation of novel enzymes that are capable of catalyzing targeted chemical reactions. PMID:23649589
Mehranfar, Adele; Ghadiri, Nasser; Kouhsar, Morteza; Golshani, Ashkan
2017-09-01
Detecting the protein complexes is an important task in analyzing the protein interaction networks. Although many algorithms predict protein complexes in different ways, surveys on the interaction networks indicate that about 50% of detected interactions are false positives. Consequently, the accuracy of existing methods needs to be improved. In this paper we propose a novel algorithm to detect the protein complexes in 'noisy' protein interaction data. First, we integrate several biological data sources to determine the reliability of each interaction and determine more accurate weights for the interactions. A data fusion component is used for this step, based on the interval type-2 fuzzy voter that provides an efficient combination of the information sources. This fusion component detects the errors and diminishes their effect on the detection protein complexes. So in the first step, the reliability scores have been assigned for every interaction in the network. In the second step, we have proposed a general protein complex detection algorithm by exploiting and adopting the strong points of other algorithms and existing hypotheses regarding real complexes. Finally, the proposed method has been applied for the yeast interaction datasets for predicting the interactions. The results show that our framework has a better performance regarding precision and F-measure than the existing approaches. Copyright © 2017 Elsevier Ltd. All rights reserved.
Geometric modeling of subcellular structures, organelles, and multiprotein complexes
Feng, Xin; Xia, Kelin; Tong, Yiying; Wei, Guo-Wei
2013-01-01
SUMMARY Recently, the structure, function, stability, and dynamics of subcellular structures, organelles, and multi-protein complexes have emerged as a leading interest in structural biology. Geometric modeling not only provides visualizations of shapes for large biomolecular complexes but also fills the gap between structural information and theoretical modeling, and enables the understanding of function, stability, and dynamics. This paper introduces a suite of computational tools for volumetric data processing, information extraction, surface mesh rendering, geometric measurement, and curvature estimation of biomolecular complexes. Particular emphasis is given to the modeling of cryo-electron microscopy data. Lagrangian-triangle meshes are employed for the surface presentation. On the basis of this representation, algorithms are developed for surface area and surface-enclosed volume calculation, and curvature estimation. Methods for volumetric meshing have also been presented. Because the technological development in computer science and mathematics has led to multiple choices at each stage of the geometric modeling, we discuss the rationales in the design and selection of various algorithms. Analytical models are designed to test the computational accuracy and convergence of proposed algorithms. Finally, we select a set of six cryo-electron microscopy data representing typical subcellular complexes to demonstrate the efficacy of the proposed algorithms in handling biomolecular surfaces and explore their capability of geometric characterization of binding targets. This paper offers a comprehensive protocol for the geometric modeling of subcellular structures, organelles, and multiprotein complexes. PMID:23212797
Heterodimer Binding Scaffolds Recognition via the Analysis of Kinetically Hot Residues.
Perišić, Ognjen
2018-03-16
Physical interactions between proteins are often difficult to decipher. The aim of this paper is to present an algorithm that is designed to recognize binding patches and supporting structural scaffolds of interacting heterodimer proteins using the Gaussian Network Model (GNM). The recognition is based on the (self) adjustable identification of kinetically hot residues and their connection to possible binding scaffolds. The kinetically hot residues are residues with the lowest entropy, i.e., the highest contribution to the weighted sum of the fastest modes per chain extracted via GNM. The algorithm adjusts the number of fast modes in the GNM's weighted sum calculation using the ratio of predicted and expected numbers of target residues (contact and the neighboring first-layer residues). This approach produces very good results when applied to dimers with high protein sequence length ratios. The protocol's ability to recognize near native decoys was compared to the ability of the residue-level statistical potential of Lu and Skolnick using the Sternberg and Vakser decoy dimers sets. The statistical potential produced better overall results, but in a number of cases its predicting ability was comparable, or even inferior, to the prediction ability of the adjustable GNM approach. The results presented in this paper suggest that in heterodimers at least one protein has interacting scaffold determined by the immovable, kinetically hot residues. In many cases, interacting proteins (especially if being of noticeably different sizes) either behave as a rigid lock and key or, presumably, exhibit the opposite dynamic behavior. While the binding surface of one protein is rigid and stable, its partner's interacting scaffold is more flexible and adaptable.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-07-01
UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
An extensive library of surrogate peptides for all human proteins.
Mohammed, Yassene; Borchers, Christoph H
2015-11-03
Selecting the most appropriate surrogate peptides to represent a target protein is a major component of experimental design in Multiple Reaction Monitoring (MRM). Our software PeptidePicker with its v-score remains distinctive in its approach of integrating information about the proteins, their tryptic peptides, and the suitability of these peptides for MRM that is available online in UniProtKB, NCBI's dbSNP, ExPASy, PeptideAtlas, PRIDE, and GPMDB. The scoring algorithm reflects our "best knowledge" for selecting candidate peptides for MRM, based on the uniqueness of the peptide in the targeted proteome, its physiochemical properties, and whether it has previously been observed. Here we present an updated approach where we have already compiled a list of all possible surrogate peptides of the human proteome. Using our stringent selection criteria, the list includes 165k suitable MRM peptides covering 17k proteins of the human reviewed proteins in UniProtKB. Compared to average of 2-4min per protein for retrieving and integrating the information, the precompiled list includes all peptides available instantly. This allows a more cohesive and faster design of a multiplexed MRM experiment and provides insights into evidence for a protein's existence. We will keep this list up-to-date as proteomics data repositories continue to grow. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015 Elsevier B.V. All rights reserved.
De novo design and engineering of functional metal and porphyrin-binding protein domains
NASA Astrophysics Data System (ADS)
Everson, Bernard H.
In this work, I describe an approach to the rational, iterative design and characterization of two functional cofactor-binding protein domains. First, a hybrid computational/experimental method was developed with the aim of algorithmically generating a suite of porphyrin-binding protein sequences with minimal mutual sequence information. This method was explored by generating libraries of sequences, which were then expressed and evaluated for function. One successful sequence is shown to bind a variety of porphyrin-like cofactors, and exhibits light- activated electron transfer in mixed hemin:chlorin e6 and hemin:Zn(II)-protoporphyrin IX complexes. These results imply that many sophisticated functions such as cofactor binding and electron transfer require only a very small number of residue positions in a protein sequence to be fixed. Net charge and hydrophobic content are important in determining protein solubility and stability. Accordingly, rational modifications were made to the aforementioned design procedure in order to improve its overall success rate. The effects of these modifications are explored using two `next-generation' sequence libraries, which were separately expressed and evaluated. Particular modifications to these design parameters are demonstrated to effectively double the purification success rate of the procedure. Finally, I describe the redesign of the artificial di-iron protein DF2 into CDM13, a single chain di-Manganese four-helix bundle. CDM13 acts as a functional model of natural manganese catalase, exhibiting a kcat of 0.08s-1 under steady-state conditions. The bound manganese cofactors have a reduction potential of +805 mV vs NHE, which is too high for efficient dismutation of hydrogen peroxide. These results indicate that as a high-potential manganese complex, CDM13 may represent a promising first step toward a polypeptide model of the Oxygen Evolving Complex of the photosynthetic enzyme Photosystem II.
An improved stochastic fractal search algorithm for 3D protein structure prediction.
Zhou, Changjun; Sun, Chuan; Wang, Bin; Wang, Xiaojun
2018-05-03
Protein structure prediction (PSP) is a significant area for biological information research, disease treatment, and drug development and so on. In this paper, three-dimensional structures of proteins are predicted based on the known amino acid sequences, and the structure prediction problem is transformed into a typical NP problem by an AB off-lattice model. This work applies a novel improved Stochastic Fractal Search algorithm (ISFS) to solve the problem. The Stochastic Fractal Search algorithm (SFS) is an effective evolutionary algorithm that performs well in exploring the search space but falls into local minimums sometimes. In order to avoid the weakness, Lvy flight and internal feedback information are introduced in ISFS. In the experimental process, simulations are conducted by ISFS algorithm on Fibonacci sequences and real peptide sequences. Experimental results prove that the ISFS performs more efficiently and robust in terms of finding the global minimum and avoiding getting stuck in local minimums.
PROTERAN: animated terrain evolution for visual analysis of patterns in protein folding trajectory.
Zhou, Ruhong; Parida, Laxmi; Kapila, Kush; Mudur, Sudhir
2007-01-01
The mechanism of protein folding remains largely a mystery in molecular biology, despite the enormous effort from many groups in the past decades. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus various reaction coordinates such as the fraction of native contacts, the radius of gyration and so on. In this paper, we present an integrated approach towards understanding the folding process via visual analysis of patterns of these reaction coordinates. The three disparate processes (1) protein folding simulation, (2) pattern elicitation and (3) visualization of patterns, work in tandem. Thus as the protein folds, the changing landscape in the pattern space can be viewed via the visualization tool, PROTERAN, a program we developed for this purpose. We first present an incremental (on-line) trie-based pattern discovery algorithm to elicit the patterns and then describe the terrain metaphor based visualization tool. Using two example small proteins, a beta-hairpin and a designed protein Trp-cage, we next demonstrate that this combined pattern discovery and visualization approach extracts crucial information about protein folding intermediates and mechanism.
Campagne, F; Weinstein, H
1999-01-01
An algorithmic method for drawing residue-based schematic diagrams of proteins on a 2D page is presented and illustrated. The method allows the creation of rendering engines dedicated to a given family of sequences, or fold. The initial implementation provides an engine that can produce a 2D diagram representing secondary structure for any transmembrane protein sequence. We present the details of the strategy for automating the drawing of these diagrams. The most important part of this strategy is the development of an algorithm for laying out residues of a loop that connects to arbitrary points of a 2D plane. As implemented, this algorithm is suitable for real-time modification of the loop layout. This work is of interest for the representation and analysis of data from (1) protein databases, (2) mutagenesis results, or (3) various kinds of protein context-dependent annotations or data.
Accelerating Molecular Dynamic Simulation on Graphics Processing Units
Friedrichs, Mark S.; Eastman, Peter; Vaidyanathan, Vishal; Houston, Mike; Legrand, Scott; Beberg, Adam L.; Ensign, Daniel L.; Bruns, Christopher M.; Pande, Vijay S.
2009-01-01
We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. PMID:19191337
Ball, David A; Lux, Matthew W; Graef, Russell R; Peterson, Matthew W; Valenti, Jane D; Dileo, John; Peccoud, Jean
2010-01-01
The concept of co-design is common in engineering, where it is necessary, for example, to determine the optimal partitioning between hardware and software of the implementation of a system features. Here we propose to adapt co-design methodologies for synthetic biology. As a test case, we have designed an environmental sensing device that detects the presence of three chemicals, and returns an output only if at least two of the three chemicals are present. We show that the logical operations can be implemented in three different design domains: (1) the transcriptional domain using synthetically designed hybrid promoters, (2) the protein domain using bi-molecular fluorescence complementation, and (3) the fluorescence domain using spectral unmixing and relying on electronic processing. We discuss how these heterogeneous design strategies could be formalized to develop co-design algorithms capable of identifying optimal designs meeting user specifications.
Cankorur-Cetinkaya, Ayca; Dias, Joao M L; Kludas, Jana; Slater, Nigel K H; Rousu, Juho; Oliver, Stephen G; Dikicioglu, Duygu
2017-06-01
Multiple interacting factors affect the performance of engineered biological systems in synthetic biology projects. The complexity of these biological systems means that experimental design should often be treated as a multiparametric optimization problem. However, the available methodologies are either impractical, due to a combinatorial explosion in the number of experiments to be performed, or are inaccessible to most experimentalists due to the lack of publicly available, user-friendly software. Although evolutionary algorithms may be employed as alternative approaches to optimize experimental design, the lack of simple-to-use software again restricts their use to specialist practitioners. In addition, the lack of subsidiary approaches to further investigate critical factors and their interactions prevents the full analysis and exploitation of the biotechnological system. We have addressed these problems and, here, provide a simple-to-use and freely available graphical user interface to empower a broad range of experimental biologists to employ complex evolutionary algorithms to optimize their experimental designs. Our approach exploits a Genetic Algorithm to discover the subspace containing the optimal combination of parameters, and Symbolic Regression to construct a model to evaluate the sensitivity of the experiment to each parameter under investigation. We demonstrate the utility of this method using an example in which the culture conditions for the microbial production of a bioactive human protein are optimized. CamOptimus is available through: (https://doi.org/10.17863/CAM.10257).
Structural alignment of protein descriptors - a combinatorial model.
Antczak, Maciej; Kasprzak, Marta; Lukasiak, Piotr; Blazewicz, Jacek
2016-09-17
Structural alignment of proteins is one of the most challenging problems in molecular biology. The tertiary structure of a protein strictly correlates with its function and computationally predicted structures are nowadays a main premise for understanding the latter. However, computationally derived 3D models often exhibit deviations from the native structure. A way to confirm a model is a comparison with other structures. The structural alignment of a pair of proteins can be defined with the use of a concept of protein descriptors. The protein descriptors are local substructures of protein molecules, which allow us to divide the original problem into a set of subproblems and, consequently, to propose a more efficient algorithmic solution. In the literature, one can find many applications of the descriptors concept that prove its usefulness for insight into protein 3D structures, but the proposed approaches are presented rather from the biological perspective than from the computational or algorithmic point of view. Efficient algorithms for identification and structural comparison of descriptors can become crucial components of methods for structural quality assessment as well as tertiary structure prediction. In this paper, we propose a new combinatorial model and new polynomial-time algorithms for the structural alignment of descriptors. The model is based on the maximum-size assignment problem, which we define here and prove that it can be solved in polynomial time. We demonstrate suitability of this approach by comparison with an exact backtracking algorithm. Besides a simplification coming from the combinatorial modeling, both on the conceptual and complexity level, we gain with this approach high quality of obtained results, in terms of 3D alignment accuracy and processing efficiency. All the proposed algorithms were developed and integrated in a computationally efficient tool descs-standalone, which allows the user to identify and structurally compare descriptors of biological molecules, such as proteins and RNAs. Both PDB (Protein Data Bank) and mmCIF (macromolecular Crystallographic Information File) formats are supported. The proposed tool is available as an open source project stored on GitHub ( https://github.com/mantczak/descs-standalone ).
Identifying Dynamic Protein Complexes Based on Gene Expression Profiles and PPI Networks
Li, Min; Chen, Weijie; Wang, Jianxin; Pan, Yi
2014-01-01
Identification of protein complexes from protein-protein interaction networks has become a key problem for understanding cellular life in postgenomic era. Many computational methods have been proposed for identifying protein complexes. Up to now, the existing computational methods are mostly applied on static PPI networks. However, proteins and their interactions are dynamic in reality. Identifying dynamic protein complexes is more meaningful and challenging. In this paper, a novel algorithm, named DPC, is proposed to identify dynamic protein complexes by integrating PPI data and gene expression profiles. According to Core-Attachment assumption, these proteins which are always active in the molecular cycle are regarded as core proteins. The protein-complex cores are identified from these always active proteins by detecting dense subgraphs. Final protein complexes are extended from the protein-complex cores by adding attachments based on a topological character of “closeness” and dynamic meaning. The protein complexes produced by our algorithm DPC contain two parts: static core expressed in all the molecular cycle and dynamic attachments short-lived. The proposed algorithm DPC was applied on the data of Saccharomyces cerevisiae and the experimental results show that DPC outperforms CMC, MCL, SPICi, HC-PIN, COACH, and Core-Attachment based on the validation of matching with known complexes and hF-measures. PMID:24963481
Discovering protein complexes in protein interaction networks via exploring the weak ties effect
2012-01-01
Background Studying protein complexes is very important in biological processes since it helps reveal the structure-functionality relationships in biological networks and much attention has been paid to accurately predict protein complexes from the increasing amount of protein-protein interaction (PPI) data. Most of the available algorithms are based on the assumption that dense subgraphs correspond to complexes, failing to take into account the inherence organization within protein complex and the roles of edges. Thus, there is a critical need to investigate the possibility of discovering protein complexes using the topological information hidden in edges. Results To provide an investigation of the roles of edges in PPI networks, we show that the edges connecting less similar vertices in topology are more significant in maintaining the global connectivity, indicating the weak ties phenomenon in PPI networks. We further demonstrate that there is a negative relation between the weak tie strength and the topological similarity. By using the bridges, a reliable virtual network is constructed, in which each maximal clique corresponds to the core of a complex. By this notion, the detection of the protein complexes is transformed into a classic all-clique problem. A novel core-attachment based method is developed, which detects the cores and attachments, respectively. A comprehensive comparison among the existing algorithms and our algorithm has been made by comparing the predicted complexes against benchmark complexes. Conclusions We proved that the weak tie effect exists in the PPI network and demonstrated that the density is insufficient to characterize the topological structure of protein complexes. Furthermore, the experimental results on the yeast PPI network show that the proposed method outperforms the state-of-the-art algorithms. The analysis of detected modules by the present algorithm suggests that most of these modules have well biological significance in context of complexes, suggesting that the roles of edges are critical in discovering protein complexes. PMID:23046740
Aligning Biomolecular Networks Using Modular Graph Kernels
NASA Astrophysics Data System (ADS)
Towfic, Fadi; Greenlee, M. Heather West; Honavar, Vasant
Comparative analysis of biomolecular networks constructed using measurements from different conditions, tissues, and organisms offer a powerful approach to understanding the structure, function, dynamics, and evolution of complex biological systems. We explore a class of algorithms for aligning large biomolecular networks by breaking down such networks into subgraphs and computing the alignment of the networks based on the alignment of their subgraphs. The resulting subnetworks are compared using graph kernels as scoring functions. We provide implementations of the resulting algorithms as part of BiNA, an open source biomolecular network alignment toolkit. Our experiments using Drosophila melanogaster, Saccharomyces cerevisiae, Mus musculus and Homo sapiens protein-protein interaction networks extracted from the DIP repository of protein-protein interaction data demonstrate that the performance of the proposed algorithms (as measured by % GO term enrichment of subnetworks identified by the alignment) is competitive with some of the state-of-the-art algorithms for pair-wise alignment of large protein-protein interaction networks. Our results also show that the inter-species similarity scores computed based on graph kernels can be used to cluster the species into a species tree that is consistent with the known phylogenetic relationships among the species.
Comparison of designed and randomly generated catalysts for simple chemical reactions.
Kipnis, Yakov; Baker, David
2012-09-01
There has been recent success in designing enzymes for simple chemical reactions using a two-step protocol. In the first step, a geometric matching algorithm is used to identify naturally occurring protein scaffolds at which predefined idealized active sites can be realized. In the second step, the residues surrounding the transition state model are optimized to increase transition state binding affinity and to bolster the primary catalytic side chains. To improve the design methodology, we investigated how the set of solutions identified by the design calculations relate to the overall set of solutions for two different chemical reactions. Using a TIM barrel scaffold in which catalytically active Kemp eliminase and retroaldolase designs were obtained previously, we carried out activity screens of random libraries made to be compositionally similar to active designs. A small number of active catalysts were found in screens of 10³ variants for each of the two reactions, which differ from the computational designs in that they reuse charged residues already present in the native scaffold. The results suggest that computational design considerably increases the frequency of catalyst generation for active sites involving newly introduced catalytic residues, highlighting the importance of interaction cooperativity in enzyme active sites. Copyright © 2012 The Protein Society.
Problem Solving Techniques for the Design of Algorithms.
ERIC Educational Resources Information Center
Kant, Elaine; Newell, Allen
1984-01-01
Presents model of algorithm design (activity in software development) based on analysis of protocols of two subjects designing three convex hull algorithms. Automation methods, methods for studying algorithm design, role of discovery in problem solving, and comparison of different designs of case study according to model are highlighted.…
Navarro, Gonzalo; Raffinot, Mathieu
2003-01-01
The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.
Automatic detection and measurement of viral replication compartments by ellipse adjustment
Garcés, Yasel; Guerrero, Adán; Hidalgo, Paloma; López, Raul Eduardo; Wood, Christopher D.; Gonzalez, Ramón A.; Rendón-Mancha, Juan Manuel
2016-01-01
Viruses employ a variety of strategies to hijack cellular activities through the orchestrated recruitment of macromolecules to specific virus-induced cellular micro-environments. Adenoviruses (Ad) and other DNA viruses induce extensive reorganization of the cell nucleus and formation of nuclear Replication Compartments (RCs), where the viral genome is replicated and expressed. In this work an automatic algorithm designed for detection and segmentation of RCs using ellipses is presented. Unlike algorithms available in the literature, this approach is deterministic, automatic, and can adjust multiple RCs using ellipses. The proposed algorithm is non iterative, computationally efficient and is invariant to affine transformations. The method was validated over both synthetic images and more than 400 real images of Ad-infected cells at various timepoints of the viral replication cycle obtaining relevant information about the biogenesis of adenoviral RCs. As proof of concept the algorithm was then used to quantitatively compare RCs in cells infected with the adenovirus wild type or an adenovirus mutant that is null for expression of a viral protein that is known to affect activities associated with RCs that result in deficient viral progeny production. PMID:27819325
Automatic detection and measurement of viral replication compartments by ellipse adjustment
NASA Astrophysics Data System (ADS)
Garcés, Yasel; Guerrero, Adán; Hidalgo, Paloma; López, Raul Eduardo; Wood, Christopher D.; Gonzalez, Ramón A.; Rendón-Mancha, Juan Manuel
2016-11-01
Viruses employ a variety of strategies to hijack cellular activities through the orchestrated recruitment of macromolecules to specific virus-induced cellular micro-environments. Adenoviruses (Ad) and other DNA viruses induce extensive reorganization of the cell nucleus and formation of nuclear Replication Compartments (RCs), where the viral genome is replicated and expressed. In this work an automatic algorithm designed for detection and segmentation of RCs using ellipses is presented. Unlike algorithms available in the literature, this approach is deterministic, automatic, and can adjust multiple RCs using ellipses. The proposed algorithm is non iterative, computationally efficient and is invariant to affine transformations. The method was validated over both synthetic images and more than 400 real images of Ad-infected cells at various timepoints of the viral replication cycle obtaining relevant information about the biogenesis of adenoviral RCs. As proof of concept the algorithm was then used to quantitatively compare RCs in cells infected with the adenovirus wild type or an adenovirus mutant that is null for expression of a viral protein that is known to affect activities associated with RCs that result in deficient viral progeny production.
Yoon, Young-Gyu; Dai, Peilun; Wohlwend, Jeremy; Chang, Jae-Byum; Marblestone, Adam H.; Boyden, Edward S.
2017-01-01
We here introduce and study the properties, via computer simulation, of a candidate automated approach to algorithmic reconstruction of dense neural morphology, based on simulated data of the kind that would be obtained via two emerging molecular technologies—expansion microscopy (ExM) and in-situ molecular barcoding. We utilize a convolutional neural network to detect neuronal boundaries from protein-tagged plasma membrane images obtained via ExM, as well as a subsequent supervoxel-merging pipeline guided by optical readout of information-rich, cell-specific nucleic acid barcodes. We attempt to use conservative imaging and labeling parameters, with the goal of establishing a baseline case that points to the potential feasibility of optical circuit reconstruction, leaving open the possibility of higher-performance labeling technologies and algorithms. We find that, even with these conservative assumptions, an all-optical approach to dense neural morphology reconstruction may be possible via the proposed algorithmic framework. Future work should explore both the design-space of chemical labels and barcodes, as well as algorithms, to ultimately enable routine, high-performance optical circuit reconstruction. PMID:29114215
Yoon, Young-Gyu; Dai, Peilun; Wohlwend, Jeremy; Chang, Jae-Byum; Marblestone, Adam H; Boyden, Edward S
2017-01-01
We here introduce and study the properties, via computer simulation, of a candidate automated approach to algorithmic reconstruction of dense neural morphology, based on simulated data of the kind that would be obtained via two emerging molecular technologies-expansion microscopy (ExM) and in-situ molecular barcoding. We utilize a convolutional neural network to detect neuronal boundaries from protein-tagged plasma membrane images obtained via ExM, as well as a subsequent supervoxel-merging pipeline guided by optical readout of information-rich, cell-specific nucleic acid barcodes. We attempt to use conservative imaging and labeling parameters, with the goal of establishing a baseline case that points to the potential feasibility of optical circuit reconstruction, leaving open the possibility of higher-performance labeling technologies and algorithms. We find that, even with these conservative assumptions, an all-optical approach to dense neural morphology reconstruction may be possible via the proposed algorithmic framework. Future work should explore both the design-space of chemical labels and barcodes, as well as algorithms, to ultimately enable routine, high-performance optical circuit reconstruction.
Ganapathiraju, Madhavi K; Orii, Naoki
2013-08-30
Advances in biotechnology have created "big-data" situations in molecular and cellular biology. Several sophisticated algorithms have been developed that process big data to generate hundreds of biomedical hypotheses (or predictions). The bottleneck to translating this large number of biological hypotheses is that each of them needs to be studied by experimentation for interpreting its functional significance. Even when the predictions are estimated to be very accurate, from a biologist's perspective, the choice of which of these predictions is to be studied further is made based on factors like availability of reagents and resources and the possibility of formulating some reasonable hypothesis about its biological relevance. When viewed from a global perspective, say from that of a federal funding agency, ideally the choice of which prediction should be studied would be made based on which of them can make the most translational impact. We propose that algorithms be developed to identify which of the computationally generated hypotheses have potential for high translational impact; this way, funding agencies and scientific community can invest resources and drive the research based on a global view of biomedical impact without being deterred by local view of feasibility. In short, data-analytic algorithms analyze big-data and generate hypotheses; in contrast, the proposed inference-analytic algorithms analyze these hypotheses and rank them by predicted biological impact. We demonstrate this through the development of an algorithm to predict biomedical impact of protein-protein interactions (PPIs) which is estimated by the number of future publications that cite the paper which originally reported the PPI. This position paper describes a new computational problem that is relevant in the era of big-data and discusses the challenges that exist in studying this problem, highlighting the need for the scientific community to engage in this line of research. The proposed class of algorithms, namely inference-analytic algorithms, is necessary to ensure that resources are invested in translating those computational outcomes that promise maximum biological impact. Application of this concept to predict biomedical impact of PPIs illustrates not only the concept, but also the challenges in designing these algorithms.
SnapDock—template-based docking by Geometric Hashing
Estrin, Michael; Wolfson, Haim J.
2017-01-01
Abstract Motivation: A highly efficient template-based protein–protein docking algorithm, nicknamed SnapDock, is presented. It employs a Geometric Hashing-based structural alignment scheme to align the target proteins to the interfaces of non-redundant protein–protein interface libraries. Docking of a pair of proteins utilizing the 22 600 interface PIFACE library is performed in < 2 min on the average. A flexible version of the algorithm allowing hinge motion in one of the proteins is presented as well. Results: To evaluate the performance of the algorithm a blind re-modelling of 3547 PDB complexes, which have been uploaded after the PIFACE publication has been performed with success ratio of about 35%. Interestingly, a similar experiment with the template free PatchDock docking algorithm yielded a success rate of about 23% with roughly 1/3 of the solutions different from those of SnapDock. Consequently, the combination of the two methods gave a 42% success ratio. Availability and implementation: A web server of the application is under development. Contact: michaelestrin@gmail.com or wolfson@tau.ac.il PMID:28881968
Designing nucleosomal force sensors
NASA Astrophysics Data System (ADS)
Tompitak, M.; de Bruin, L.; Eslami-Mossallam, B.; Schiessel, H.
2017-05-01
About three quarters of our DNA is wrapped into nucleosomes: DNA spools with a protein core. It is well known that the affinity of a given DNA stretch to be incorporated into a nucleosome depends on the geometry and elasticity of the basepair sequence involved, causing the positioning of nucleosomes. Here we show that DNA elasticity can have a much deeper effect on nucleosomes than just their positioning: it affects their "identities". Employing a recently developed computational algorithm, the mutation Monte Carlo method, we design nucleosomes with surprising physical characteristics. Unlike any other nucleosomes studied so far, these nucleosomes are short-lived when put under mechanical tension whereas other physical properties are largely unaffected. This suggests that the nucleosome, the most abundant DNA-protein complex in our cells, might more properly be considered a class of complexes with a wide array of physical properties, and raises the possibility that evolution has shaped various nucleosome species according to their genomic context.
NASA Astrophysics Data System (ADS)
Ratamero, Erick Martins; Bellini, Dom; Dowson, Christopher G.; Römer, Rudolf A.
2018-06-01
The ability to precisely visualize the atomic geometry of the interactions between a drug and its protein target in structural models is critical in predicting the correct modifications in previously identified inhibitors to create more effective next generation drugs. It is currently common practice among medicinal chemists while attempting the above to access the information contained in three-dimensional structures by using two-dimensional projections, which can preclude disclosure of useful features. A more accessible and intuitive visualization of the three-dimensional configuration of the atomic geometry in the models can be achieved through the implementation of immersive virtual reality (VR). While bespoke commercial VR suites are available, in this work, we present a freely available software pipeline for visualising protein structures through VR. New consumer hardware, such as the uc(HTC Vive) and the uc(Oculus Rift) utilized in this study, are available at reasonable prices. As an instructive example, we have combined VR visualization with fast algorithms for simulating intramolecular motions of protein flexibility, in an effort to further improve structure-led drug design by exposing molecular interactions that might be hidden in the less informative static models. This is a paradigmatic test case scenario for many similar applications in computer-aided molecular studies and design.
Ratamero, Erick Martins; Bellini, Dom; Dowson, Christopher G; Römer, Rudolf A
2018-06-07
The ability to precisely visualize the atomic geometry of the interactions between a drug and its protein target in structural models is critical in predicting the correct modifications in previously identified inhibitors to create more effective next generation drugs. It is currently common practice among medicinal chemists while attempting the above to access the information contained in three-dimensional structures by using two-dimensional projections, which can preclude disclosure of useful features. A more accessible and intuitive visualization of the three-dimensional configuration of the atomic geometry in the models can be achieved through the implementation of immersive virtual reality (VR). While bespoke commercial VR suites are available, in this work, we present a freely available software pipeline for visualising protein structures through VR. New consumer hardware, such as the HTC VIVE and the OCULUS RIFT utilized in this study, are available at reasonable prices. As an instructive example, we have combined VR visualization with fast algorithms for simulating intramolecular motions of protein flexibility, in an effort to further improve structure-led drug design by exposing molecular interactions that might be hidden in the less informative static models. This is a paradigmatic test case scenario for many similar applications in computer-aided molecular studies and design.
Humbeck, Lina; Weigang, Sebastian; Schäfer, Till; Mutzel, Petra; Koch, Oliver
2018-03-20
A common issue during drug design and development is the discovery of novel scaffolds for protein targets. On the one hand the chemical space of purchasable compounds is rather limited; on the other hand artificially generated molecules suffer from a grave lack of accessibility in practice. Therefore, we generated a novel virtual library of small molecules which are synthesizable from purchasable educts, called CHIPMUNK (CHemically feasible In silico Public Molecular UNiverse Knowledge base). Altogether, CHIPMUNK covers over 95 million compounds and encompasses regions of the chemical space that are not covered by existing databases. The coverage of CHIPMUNK exceeds the chemical space spanned by the Lipinski rule of five to foster the exploration of novel and difficult target classes. The analysis of the generated property space reveals that CHIPMUNK is well suited for the design of protein-protein interaction inhibitors (PPIIs). Furthermore, a recently developed structural clustering algorithm (StruClus) for big data was used to partition the sub-libraries into meaningful subsets and assist scientists to process the large amount of data. These clustered subsets also contain the target space based on ChEMBL data which was included during clustering. © 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-01-01
Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742
A Consensus Method for the Prediction of ‘Aggregation-Prone’ Peptides in Globular Proteins
Tsolis, Antonios C.; Papandreou, Nikos C.; Iconomidou, Vassiliki A.; Hamodrakas, Stavros J.
2013-01-01
The purpose of this work was to construct a consensus prediction algorithm of ‘aggregation-prone’ peptides in globular proteins, combining existing tools. This allows comparison of the different algorithms and the production of more objective and accurate results. Eleven (11) individual methods are combined and produce AMYLPRED2, a publicly, freely available web tool to academic users (http://biophysics.biol.uoa.gr/AMYLPRED2), for the consensus prediction of amyloidogenic determinants/‘aggregation-prone’ peptides in proteins, from sequence alone. The performance of AMYLPRED2 indicates that it functions better than individual aggregation-prediction algorithms, as perhaps expected. AMYLPRED2 is a useful tool for identifying amyloid-forming regions in proteins that are associated with several conformational diseases, called amyloidoses, such as Altzheimer's, Parkinson's, prion diseases and type II diabetes. It may also be useful for understanding the properties of protein folding and misfolding and for helping to the control of protein aggregation/solubility in biotechnology (recombinant proteins forming bacterial inclusion bodies) and biotherapeutics (monoclonal antibodies and biopharmaceutical proteins). PMID:23326595
Generate Optimized Genetic Rhythm for Enzyme Expression in Non-native systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
2016-11-03
Most amino acids are represented by more than one codon, resulting in redundancy in the genetic code. Silent codon substitutions that do not alter the amino acid sequence still have an effect on protein expression. We have developed an algorithm, GoGREEN, to enhance the expression of foreign proteins in a host organism. GoGREEN selects codons according to frequency patterns seen in the gene of interest using the codon usage table from the host organism. GoGREEN is also designed to accommodate gaps in the sequence.This software takes for input (1) the aligned protein sequences for genes the user wishes to express,more » (2) the codon usage table for the host organism, (3) and the DNA sequence for the target protein found in the host organism. The program will select codons based on codon usage patterns for the target DNA sequence. The program will also select codons for “gaps” found in the aligned protein sequences using the codon usage table from the host organism.« less
Mallika, V; Sivakumar, K C; Jaichand, S; Soniya, E V
2010-07-13
Type III Polyketide synthases (PKS) are family of proteins considered to have significant roles in the biosynthesis of various polyketides in plants, fungi and bacteria. As these proteins shows positive effects to human health, more researches are going on regarding this particular protein. Developing a tool to identify the probability of sequence being a type III polyketide synthase will minimize the time consumption and manpower efforts. In this approach, we have designed and implemented PKSIIIpred, a high performance prediction server for type III PKS where the classifier is Support Vector Machines (SVMs). Based on the limited training dataset, the tool efficiently predicts the type III PKS superfamily of proteins with high sensitivity and specificity. The PKSIIIpred is available at http://type3pks.in/prediction/. We expect that this tool may serve as a useful resource for type III PKS researchers. Currently work is being progressed for further betterment of prediction accuracy by including more sequence features in the training dataset.
Simakov, Nikolay A.
2010-01-01
A soft repulsion (SR) model of short range interactions between mobile ions and protein atoms is introduced in the framework of continuum representation of the protein and solvent. The Poisson-Nernst-Plank (PNP) theory of ion transport through biological channels is modified to incorporate this soft wall protein model. Two sets of SR parameters are introduced: the first is parameterized for all essential amino acid residues using all atom molecular dynamic simulations; the second is a truncated Lennard – Jones potential. We have further designed an energy based algorithm for the determination of the ion accessible volume, which is appropriate for a particular system discretization. The effects of these models of short-range interaction were tested by computing current-voltage characteristics of the α-hemolysin channel. The introduced SR potentials significantly improve prediction of channel selectivity. In addition, we studied the effect of choice of some space-dependent diffusion coefficient distributions on the predicted current-voltage properties. We conclude that the diffusion coefficient distributions largely affect total currents and have little effect on rectifications, selectivity or reversal potential. The PNP-SR algorithm is implemented in a new efficient parallel Poisson, Poisson-Boltzman and PNP equation solver, also incorporated in a graphical molecular modeling package HARLEM. PMID:21028776
Bryce, Richard A
2011-04-01
The ability to accurately predict the interaction of a ligand with its receptor is a key limitation in computer-aided drug design approaches such as virtual screening and de novo design. In this article, we examine current strategies for a physics-based approach to scoring of protein-ligand affinity, as well as outlining recent developments in force fields and quantum chemical techniques. We also consider advances in the development and application of simulation-based free energy methods to study protein-ligand interactions. Fuelled by recent advances in computational algorithms and hardware, there is the opportunity for increased integration of physics-based scoring approaches at earlier stages in computationally guided drug discovery. Specifically, we envisage increased use of implicit solvent models and simulation-based scoring methods as tools for computing the affinities of large virtual ligand libraries. Approaches based on end point simulations and reference potentials allow the application of more advanced potential energy functions to prediction of protein-ligand binding affinities. Comprehensive evaluation of polarizable force fields and quantum mechanical (QM)/molecular mechanical and QM methods in scoring of protein-ligand interactions is required, particularly in their ability to address challenging targets such as metalloproteins and other proteins that make highly polar interactions. Finally, we anticipate increasingly quantitative free energy perturbation and thermodynamic integration methods that are practical for optimization of hits obtained from screened ligand libraries.
GRID: a high-resolution protein structure refinement algorithm.
Chitsaz, Mohsen; Mayo, Stephen L
2013-03-05
The energy-based refinement of protein structures generated by fold prediction algorithms to atomic-level accuracy remains a major challenge in structural biology. Energy-based refinement is mainly dependent on two components: (1) sufficiently accurate force fields, and (2) efficient conformational space search algorithms. Focusing on the latter, we developed a high-resolution refinement algorithm called GRID. It takes a three-dimensional protein structure as input and, using an all-atom force field, attempts to improve the energy of the structure by systematically perturbing backbone dihedrals and side-chain rotamer conformations. We compare GRID to Backrub, a stochastic algorithm that has been shown to predict a significant fraction of the conformational changes that occur with point mutations. We applied GRID and Backrub to 10 high-resolution (≤ 2.8 Å) crystal structures from the Protein Data Bank and measured the energy improvements obtained and the computation times required to achieve them. GRID resulted in energy improvements that were significantly better than those attained by Backrub while expending about the same amount of computational resources. GRID resulted in relaxed structures that had slightly higher backbone RMSDs compared to Backrub relative to the starting crystal structures. The average RMSD was 0.25 ± 0.02 Å for GRID versus 0.14 ± 0.04 Å for Backrub. These relatively minor deviations indicate that both algorithms generate structures that retain their original topologies, as expected given the nature of the algorithms. Copyright © 2012 Wiley Periodicals, Inc.
Empirical scoring functions for advanced protein-ligand docking with PLANTS.
Korb, Oliver; Stützle, Thomas; Exner, Thomas E
2009-01-01
In this paper we present two empirical scoring functions, PLANTS(CHEMPLP) and PLANTS(PLP), designed for our docking algorithm PLANTS (Protein-Ligand ANT System), which is based on ant colony optimization (ACO). They are related, regarding their functional form, to parts of already published scoring functions and force fields. The parametrization procedure described here was able to identify several parameter settings showing an excellent performance for the task of pose prediction on two test sets comprising 298 complexes in total. Up to 87% of the complexes of the Astex diverse set and 77% of the CCDC/Astex clean listnc (noncovalently bound complexes of the clean list) could be reproduced with root-mean-square deviations of less than 2 A with respect to the experimentally determined structures. A comparison with the state-of-the-art docking tool GOLD clearly shows that this is, especially for the druglike Astex diverse set, an improvement in pose prediction performance. Additionally, optimized parameter settings for the search algorithm were identified, which can be used to balance pose prediction reliability and search speed.
Grafskaia, Ekaterina N; Polina, Nadezhda F; Babenko, Vladislav V; Kharlampieva, Daria D; Bobrovsky, Pavel A; Manuvera, Valentin A; Farafonova, Tatyana E; Anikanov, Nikolay A; Lazarev, Vassili N
2018-04-01
As essential conservative component of the innate immune systems of living organisms, antimicrobial peptides (AMPs) could complement pharmaceuticals that increasingly fail to combat various pathogens exhibiting increased resistance to microbial antibiotics. Among the properties of AMPs that suggest their potential as therapeutic agents, diverse peptides in the venoms of various predators demonstrate antimicrobial activity and kill a wide range of microorganisms. To identify potent AMPs, the study reported here involved a transcriptomic profiling of the tentacle secretion of the sea anemone Cnidopus japonicus. An in silico search algorithm designed to discover toxin-like proteins containing AMPs was developed based on the evaluation of the properties and structural peculiarities of amino acid sequences. The algorithm revealed new proteins of the anemone containing antimicrobial candidate sequences, and 10 AMPs verified using high-throughput proteomics were synthesized. The antimicrobial activity of the candidate molecules was experimentally estimated against Gram-positive and -negative bacteria. Ultimately, three peptides exhibited antimicrobial activity against bacterial strains, which suggests that the method can be applied to reveal new AMPs in the venoms of other predators as well.
Signature molecular descriptor : advanced applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Visco, Donald Patrick, Jr.
In this work we report on the development of the Signature Molecular Descriptor (or Signature) for use in the solution of inverse design problems as well as in highthroughput screening applications. The ultimate goal of using Signature is to identify novel and non-intuitive chemical structures with optimal predicted properties for a given application. We demonstrate this in three studies: green solvent design, glucocorticoid receptor ligand design and the design of inhibitors for Factor XIa. In many areas of engineering, compounds are designed and/or modified in incremental ways which rely upon heuristics or institutional knowledge. Often multiple experiments are performed andmore » the optimal compound is identified in this brute-force fashion. Perhaps a traditional chemical scaffold is identified and movement of a substituent group around a ring constitutes the whole of the design process. Also notably, a chemical being evaluated in one area might demonstrate properties very attractive in another area and serendipity was the mechanism for solution. In contrast to such approaches, computer-aided molecular design (CAMD) looks to encompass both experimental and heuristic-based knowledge into a strategy that will design a molecule on a computer to meet a given target. Depending on the algorithm employed, the molecule which is designed might be quite novel (re: no CAS registration number) and/or non-intuitive relative to what is known about the problem at hand. While CAMD is a fairly recent strategy (dating to the early 1980s), it contains a variety of bottlenecks and limitations which have prevented the technique from garnering more attention in the academic, governmental and industrial institutions. A main reason for this is how the molecules are described in the computer. This step can control how models are developed for the properties of interest on a given problem as well as how to go from an output of the algorithm to an actual chemical structure. This report provides details on a technique to describe molecules on a computer, called Signature, as well as the computer-aided molecule design algorithm built around Signature. Two applications are provided of the CAMD algorithm with Signature. The first describes the design of green solvents based on data in the GlaxoSmithKline (GSK) Solvent Selection Guide. The second provides novel non-steroidal glucocorticoid receptor ligands with some optimally predicted properties. In addition to using the CAMD algorithm with Signature, it is demonstrated how to employ Signature in a high-throughput screening study. Here, after classifying both active and inactive inhibitors for the protein Factor XIa using Signature, the model developed is used to screen a large, publicly-available database called PubChem for the most active compounds.« less
PGCA: An algorithm to link protein groups created from MS/MS data
Sasaki, Mayu; Hollander, Zsuzsanna; Smith, Derek; McManus, Bruce; McMaster, W. Robert; Ng, Raymond T.; Cohen Freue, Gabriela V.
2017-01-01
The quantitation of proteins using shotgun proteomics has gained popularity in the last decades, simplifying sample handling procedures, removing extensive protein separation steps and achieving a relatively high throughput readout. The process starts with the digestion of the protein mixture into peptides, which are then separated by liquid chromatography and sequenced by tandem mass spectrometry (MS/MS). At the end of the workflow, recovering the identity of the proteins originally present in the sample is often a difficult and ambiguous process, because more than one protein identifier may match a set of peptides identified from the MS/MS spectra. To address this identification problem, many MS/MS data processing software tools combine all plausible protein identifiers matching a common set of peptides into a protein group. However, this solution introduces new challenges in studies with multiple experimental runs, which can be characterized by three main factors: i) protein groups’ identifiers are local, i.e., they vary run to run, ii) the composition of each group may change across runs, and iii) the supporting evidence of proteins within each group may also change across runs. Since in general there is no conclusive evidence about the absence of proteins in the groups, protein groups need to be linked across different runs in subsequent statistical analyses. We propose an algorithm, called Protein Group Code Algorithm (PGCA), to link groups from multiple experimental runs by forming global protein groups from connected local groups. The algorithm is computationally inexpensive and enables the connection and analysis of lists of protein groups across runs needed in biomarkers studies. We illustrate the identification problem and the stability of the PGCA mapping using 65 iTRAQ experimental runs. Further, we use two biomarker studies to show how PGCA enables the discovery of relevant candidate protein group markers with similar but non-identical compositions in different runs. PMID:28562641
Functional equivalency inferred from "authoritative sources" in networks of homologous proteins.
Natarajan, Shreedhar; Jakobsson, Eric
2009-06-12
A one-on-one mapping of protein functionality across different species is a critical component of comparative analysis. This paper presents a heuristic algorithm for discovering the Most Likely Functional Counterparts (MoLFunCs) of a protein, based on simple concepts from network theory. A key feature of our algorithm is utilization of the user's knowledge to assign high confidence to selected functional identification. We show use of the algorithm to retrieve functional equivalents for 7 membrane proteins, from an exploration of almost 40 genomes form multiple online resources. We verify the functional equivalency of our dataset through a series of tests that include sequence, structure and function comparisons. Comparison is made to the OMA methodology, which also identifies one-on-one mapping between proteins from different species. Based on that comparison, we believe that incorporation of user's knowledge as a key aspect of the technique adds value to purely statistical formal methods.
Functional Equivalency Inferred from “Authoritative Sources” in Networks of Homologous Proteins
Natarajan, Shreedhar; Jakobsson, Eric
2009-01-01
A one-on-one mapping of protein functionality across different species is a critical component of comparative analysis. This paper presents a heuristic algorithm for discovering the Most Likely Functional Counterparts (MoLFunCs) of a protein, based on simple concepts from network theory. A key feature of our algorithm is utilization of the user's knowledge to assign high confidence to selected functional identification. We show use of the algorithm to retrieve functional equivalents for 7 membrane proteins, from an exploration of almost 40 genomes form multiple online resources. We verify the functional equivalency of our dataset through a series of tests that include sequence, structure and function comparisons. Comparison is made to the OMA methodology, which also identifies one-on-one mapping between proteins from different species. Based on that comparison, we believe that incorporation of user's knowledge as a key aspect of the technique adds value to purely statistical formal methods. PMID:19521530
Cardone, Antonio; Pant, Harish; Hassan, Sergio A.
2013-01-01
Weak and ultra-weak protein-protein association play a role in molecular recognition, and can drive spontaneous self-assembly and aggregation. Such interactions are difficult to detect experimentally, and are a challenge to the force field and sampling technique. A method is proposed to identify low-population protein-protein binding modes in aqueous solution. The method is designed to identify preferential first-encounter complexes from which the final complex(es) at equilibrium evolves. A continuum model is used to represent the effects of the solvent, which accounts for short- and long-range effects of water exclusion and for liquid-structure forces at protein/liquid interfaces. These effects control the behavior of proteins in close proximity and are optimized based on binding enthalpy data and simulations. An algorithm is described to construct a biasing function for self-adaptive configurational-bias Monte Carlo of a set of interacting proteins. The function allows mixing large and local changes in the spatial distribution of proteins, thereby enhancing sampling of relevant microstates. The method is applied to three binary systems. Generalization to multiprotein complexes is discussed. PMID:24044772
Algorithm to find distant repeats in a single protein sequence
Banerjee, Nirjhar; Sarani, Rangarajan; Ranjani, Chellamuthu Vasuki; Sowmiya, Govindaraj; Michael, Daliah; Balakrishnan, Narayanasamy; Sekar, Kanagaraj
2008-01-01
Distant repeats in protein sequence play an important role in various aspects of protein analysis. A keen analysis of the distant repeats would enable to establish a firm relation of the repeats with respect to their function and three-dimensional structure during the evolutionary process. Further, it enlightens the diversity of duplication during the evolution. To this end, an algorithm has been developed to find all distant repeats in a protein sequence. The scores from Point Accepted Mutation (PAM) matrix has been deployed for the identification of amino acid substitutions while detecting the distant repeats. Due to the biological importance of distant repeats, the proposed algorithm will be of importance to structural biologists, molecular biologists, biochemists and researchers involved in phylogenetic and evolutionary studies. PMID:19052663
Moghadasi, Mohammad; Kozakov, Dima; Mamonov, Artem B.; Vakili, Pirooz; Vajda, Sandor; Paschalidis, Ioannis Ch.
2013-01-01
We introduce a message-passing algorithm to solve the Side Chain Positioning (SCP) problem. SCP is a crucial component of protein docking refinement, which is a key step of an important class of problems in computational structural biology called protein docking. We model SCP as a combinatorial optimization problem and formulate it as a Maximum Weighted Independent Set (MWIS) problem. We then employ a modified and convergent belief-propagation algorithm to solve a relaxation of MWIS and develop randomized estimation heuristics that use the relaxed solution to obtain an effective MWIS feasible solution. Using a benchmark set of protein complexes we demonstrate that our approach leads to more accurate docking predictions compared to a baseline algorithm that does not solve the SCP. PMID:23515575
Heterodimer Binding Scaffolds Recognition via the Analysis of Kinetically Hot Residues
Perišić, Ognjen
2018-01-01
Physical interactions between proteins are often difficult to decipher. The aim of this paper is to present an algorithm that is designed to recognize binding patches and supporting structural scaffolds of interacting heterodimer proteins using the Gaussian Network Model (GNM). The recognition is based on the (self) adjustable identification of kinetically hot residues and their connection to possible binding scaffolds. The kinetically hot residues are residues with the lowest entropy, i.e., the highest contribution to the weighted sum of the fastest modes per chain extracted via GNM. The algorithm adjusts the number of fast modes in the GNM’s weighted sum calculation using the ratio of predicted and expected numbers of target residues (contact and the neighboring first-layer residues). This approach produces very good results when applied to dimers with high protein sequence length ratios. The protocol’s ability to recognize near native decoys was compared to the ability of the residue-level statistical potential of Lu and Skolnick using the Sternberg and Vakser decoy dimers sets. The statistical potential produced better overall results, but in a number of cases its predicting ability was comparable, or even inferior, to the prediction ability of the adjustable GNM approach. The results presented in this paper suggest that in heterodimers at least one protein has interacting scaffold determined by the immovable, kinetically hot residues. In many cases, interacting proteins (especially if being of noticeably different sizes) either behave as a rigid lock and key or, presumably, exhibit the opposite dynamic behavior. While the binding surface of one protein is rigid and stable, its partner’s interacting scaffold is more flexible and adaptable. PMID:29547506
Stavrakas, Vassilis; Melas, Ioannis N; Sakellaropoulos, Theodore; Alexopoulos, Leonidas G
2015-01-01
Modeling of signal transduction pathways is instrumental for understanding cells' function. People have been tackling modeling of signaling pathways in order to accurately represent the signaling events inside cells' biochemical microenvironment in a way meaningful for scientists in a biological field. In this article, we propose a method to interrogate such pathways in order to produce cell-specific signaling models. We integrate available prior knowledge of protein connectivity, in a form of a Prior Knowledge Network (PKN) with phosphoproteomic data to construct predictive models of the protein connectivity of the interrogated cell type. Several computational methodologies focusing on pathways' logic modeling using optimization formulations or machine learning algorithms have been published on this front over the past few years. Here, we introduce a light and fast approach that uses a breadth-first traversal of the graph to identify the shortest pathways and score proteins in the PKN, fitting the dependencies extracted from the experimental design. The pathways are then combined through a heuristic formulation to produce a final topology handling inconsistencies between the PKN and the experimental scenarios. Our results show that the algorithm we developed is efficient and accurate for the construction of medium and large scale signaling networks. We demonstrate the applicability of the proposed approach by interrogating a manually curated interaction graph model of EGF/TNFA stimulation against made up experimental data. To avoid the possibility of erroneous predictions, we performed a cross-validation analysis. Finally, we validate that the introduced approach generates predictive topologies, comparable to the ILP formulation. Overall, an efficient approach based on graph theory is presented herein to interrogate protein-protein interaction networks and to provide meaningful biological insights.
NASA Technical Reports Server (NTRS)
Sidik, S. M.
1973-01-01
An algorithm and computer program are presented for generating all the distinct 2(p-q) fractional factorial designs. Some applications of this algorithm to the construction of tables of designs and of designs for nonstandard situations and its use in Bayesian design are discussed. An appendix includes a discussion of an actual experiment whose design was facilitated by the algorithm.
Current algorithmic solutions for peptide-based proteomics data generation and identification.
Hoopmann, Michael R; Moritz, Robert L
2013-02-01
Peptide-based proteomic data sets are ever increasing in size and complexity. These data sets provide computational challenges when attempting to quickly analyze spectra and obtain correct protein identifications. Database search and de novo algorithms must consider high-resolution MS/MS spectra and alternative fragmentation methods. Protein inference is a tricky problem when analyzing large data sets of degenerate peptide identifications. Combining multiple algorithms for improved peptide identification puts significant strain on computational systems when investigating large data sets. This review highlights some of the recent developments in peptide and protein identification algorithms for analyzing shotgun mass spectrometry data when encountering the aforementioned hurdles. Also explored are the roles that analytical pipelines, public spectral libraries, and cloud computing play in the evolution of peptide-based proteomics. Copyright © 2012 Elsevier Ltd. All rights reserved.
Al Nasr, Kamal; Ranjan, Desh; Zubair, Mohammad; Chen, Lin; He, Jing
2014-01-01
Electron cryomicroscopy is becoming a major experimental technique in solving the structures of large molecular assemblies. More and more three-dimensional images have been obtained at the medium resolutions between 5 and 10 Å. At this resolution range, major α-helices can be detected as cylindrical sticks and β-sheets can be detected as plain-like regions. A critical question in de novo modeling from cryo-EM images is to determine the match between the detected secondary structures from the image and those on the protein sequence. We formulate this matching problem into a constrained graph problem and present an O(Δ(2)N(2)2(N)) algorithm to this NP-Hard problem. The algorithm incorporates the dynamic programming approach into a constrained K-shortest path algorithm. Our method, DP-TOSS, has been tested using α-proteins with maximum 33 helices and α-β proteins up to five helices and 12 β-strands. The correct match was ranked within the top 35 for 19 of the 20 α-proteins and all nine α-β proteins tested. The results demonstrate that DP-TOSS improves accuracy, time and memory space in deriving the topologies of the secondary structure elements for proteins with a large number of secondary structures and a complex skeleton.
Li, Hongdong; Zhang, Yang; Guan, Yuanfang; Menon, Rajasree; Omenn, Gilbert S
2017-01-01
Tens of thousands of splice isoforms of proteins have been catalogued as predicted sequences from transcripts in humans and other species. Relatively few have been characterized biochemically or structurally. With the extensive development of protein bioinformatics, the characterization and modeling of isoform features, isoform functions, and isoform-level networks have advanced notably. Here we present applications of the I-TASSER family of algorithms for folding and functional predictions and the IsoFunc, MIsoMine, and Hisonet data resources for isoform-level analyses of network and pathway-based functional predictions and protein-protein interactions. Hopefully, predictions and insights from protein bioinformatics will stimulate many experimental validation studies.
Predicting "Hot" and "Warm" Spots for Fragment Binding.
Rathi, Prakash Chandra; Ludlow, R Frederick; Hall, Richard J; Murray, Christopher W; Mortenson, Paul N; Verdonk, Marcel L
2017-05-11
Computational fragment mapping methods aim to predict hotspots on protein surfaces where small fragments will bind. Such methods are popular for druggability assessment as well as structure-based design. However, to date researchers developing or using such tools have had no clear way of assessing the performance of these methods. Here, we introduce the first diverse, high quality validation set for computational fragment mapping. The set contains 52 diverse examples of fragment binding "hot" and "warm" spots from the Protein Data Bank (PDB). Additionally, we describe PLImap, a novel protocol for fragment mapping based on the Protein-Ligand Informatics force field (PLIff). We evaluate PLImap against the new fragment mapping test set, and compare its performance to that of simple shape-based algorithms and fragment docking using GOLD. PLImap is made publicly available from https://bitbucket.org/AstexUK/pli .
Wang, Rui-Sheng; Loscalzo, Joseph
2018-05-20
Understanding the genetic basis of complex diseases is challenging. Prior work shows that disease-related proteins do not typically function in isolation. Rather, they often interact with each other to form a network module that underlies dysfunctional mechanistic pathways. Identifying such disease modules will provide insights into a systems-level understanding of molecular mechanisms of diseases. Owing to the incompleteness of our knowledge of disease proteins and limited information on the biological mediators of pathobiological processes, the key proteins (seed proteins) for many diseases appear scattered over the human protein-protein interactome and form a few small branches, rather than coherent network modules. In this paper, we develop a network-based algorithm, called the Seed Connector algorithm (SCA), to pinpoint disease modules by adding as few additional linking proteins (seed connectors) to the seed protein pool as possible. Such seed connectors are hidden disease module elements that are critical for interpreting the functional context of disease proteins. The SCA aims to connect seed disease proteins so that disease mechanisms and pathways can be decoded based on predicted coherent network modules. We validate the algorithm using a large corpus of 70 complex diseases and binding targets of over 200 drugs, and demonstrate the biological relevance of the seed connectors. Lastly, as a specific proof of concept, we apply the SCA to a set of seed proteins for coronary artery disease derived from a meta-analysis of large-scale genome-wide association studies and obtain a coronary artery disease module enriched with important disease-related signaling pathways and drug targets not previously recognized. Copyright © 2018 Elsevier Ltd. All rights reserved.
Automatic classification of protein structures using physicochemical parameters.
Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam
2014-09-01
Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.
Design and implementation of a hybrid MPI-CUDA model for the Smith-Waterman algorithm.
Khaled, Heba; Faheem, Hossam El Deen Mostafa; El Gohary, Rania
2015-01-01
This paper provides a novel hybrid model for solving the multiple pair-wise sequence alignment problem combining message passing interface and CUDA, the parallel computing platform and programming model invented by NVIDIA. The proposed model targets homogeneous cluster nodes equipped with similar Graphical Processing Unit (GPU) cards. The model consists of the Master Node Dispatcher (MND) and the Worker GPU Nodes (WGN). The MND distributes the workload among the cluster working nodes and then aggregates the results. The WGN performs the multiple pair-wise sequence alignments using the Smith-Waterman algorithm. We also propose a modified implementation to the Smith-Waterman algorithm based on computing the alignment matrices row-wise. The experimental results demonstrate a considerable reduction in the running time by increasing the number of the working GPU nodes. The proposed model achieved a performance of about 12 Giga cell updates per second when we tested against the SWISS-PROT protein knowledge base running on four nodes.
Algorithmic Mechanism Design of Evolutionary Computation.
Pei, Yan
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm.
Algorithmic Mechanism Design of Evolutionary Computation
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm. PMID:26257777
NASA Astrophysics Data System (ADS)
Cheng, Jun-Hu; Jin, Huali; Liu, Zhiwei
2018-01-01
The feasibility of developing a multispectral imaging method using important wavelengths from hyperspectral images selected by genetic algorithm (GA), successive projection algorithm (SPA) and regression coefficient (RC) methods for modeling and predicting protein content in peanut kernel was investigated for the first time. Partial least squares regression (PLSR) calibration model was established between the spectral data from the selected optimal wavelengths and the reference measured protein content ranged from 23.46% to 28.43%. The RC-PLSR model established using eight key wavelengths (1153, 1567, 1972, 2143, 2288, 2339, 2389 and 2446 nm) showed the best predictive results with the coefficient of determination of prediction (R2P) of 0.901, and root mean square error of prediction (RMSEP) of 0.108 and residual predictive deviation (RPD) of 2.32. Based on the obtained best model and image processing algorithms, the distribution maps of protein content were generated. The overall results of this study indicated that developing a rapid and online multispectral imaging system using the feature wavelengths and PLSR analysis is potential and feasible for determination of the protein content in peanut kernels.
Structure-based approach to the prediction of disulfide bonds in proteins.
Salam, Noeris K; Adzhigirey, Matvey; Sherman, Woody; Pearlman, David A
2014-10-01
Protein engineering remains an area of growing importance in pharmaceutical and biotechnology research. Stabilization of a folded protein conformation is a frequent goal in projects that deal with affinity optimization, enzyme design, protein construct design, and reducing the size of functional proteins. Indeed, it can be desirable to assess and improve protein stability in order to avoid liabilities such as aggregation, degradation, and immunogenic response that may arise during development. One way to stabilize a protein is through the introduction of disulfide bonds. Here, we describe a method to predict pairs of protein residues that can be mutated to form a disulfide bond. We combine a physics-based approach that incorporates implicit solvent molecular mechanics with a knowledge-based approach. We first assign relative weights to the terms that comprise our scoring function using a genetic algorithm applied to a set of 75 wild-type structures that each contains a disulfide bond. The method is then tested on a separate set of 13 engineered proteins comprising 15 artificial stabilizing disulfides introduced via site-directed mutagenesis. We find that the native disulfide in the wild-type proteins is scored well, on average (within the top 6% of the reasonable pairs of residues that could form a disulfide bond) while 6 out of the 15 artificial stabilizing disulfides scored within the top 13% of ranked predictions. Overall, this suggests that the physics-based approach presented here can be useful for triaging possible pairs of mutations for disulfide bond formation to improve protein stability. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Spyrakis, Francesca; Benedetti, Paolo; Decherchi, Sergio; Rocchia, Walter; Cavalli, Andrea; Alcaro, Stefano; Ortuso, Francesco; Baroni, Massimo; Cruciani, Gabriele
2015-10-26
The importance of taking into account protein flexibility in drug design and virtual ligand screening (VS) has been widely debated in the literature, and molecular dynamics (MD) has been recognized as one of the most powerful tools for investigating intrinsic protein dynamics. Nevertheless, deciphering the amount of information hidden in MD simulations and recognizing a significant minimal set of states to be used in virtual screening experiments can be quite complicated. Here we present an integrated MD-FLAP (molecular dynamics-fingerprints for ligand and proteins) approach, comprising a pipeline of molecular dynamics, clustering and linear discriminant analysis, for enhancing accuracy and efficacy in VS campaigns. We first extracted a limited number of representative structures from tens of nanoseconds of MD trajectories by means of the k-medoids clustering algorithm as implemented in the BiKi Life Science Suite ( http://www.bikitech.com [accessed July 21, 2015]). Then, instead of applying arbitrary selection criteria, that is, RMSD, pharmacophore properties, or enrichment performances, we allowed the linear discriminant analysis algorithm implemented in FLAP ( http://www.moldiscovery.com [accessed July 21, 2015]) to automatically choose the best performing conformational states among medoids and X-ray structures. Retrospective virtual screenings confirmed that ensemble receptor protocols outperform single rigid receptor approaches, proved that computationally generated conformations comprise the same quantity/quality of information included in X-ray structures, and pointed to the MD-FLAP approach as a valuable tool for improving VS performances.
Folding a Protein with Equal Probability of Being Helix or Hairpin
Lin, Chun-Yu; Chen, Nan-Yow; Mou, Chung Yu
2012-01-01
We explore the possibility for the native structure of a protein being inherently multiconformational in an ab initio coarse-grained model. Based on the Wang-Landau algorithm, the complete free energy landscape for the designed sequence 2DX4: INYWLAHAKAGYIVHWTA is constructed. It is shown that 2DX4 possesses two nearly degenerate native structures: one is a helix structure with the other a hairpinlike structure, and their free energy difference is <2% of that of local minima. Two degenerate native structures are stabilized by an energy barrier of ∼10 kcal/mol. Furthermore, the hydrogen-bond and dipole-dipole interactions are found to be two major competing interactions in transforming one conformation into the other. Our results indicate that two degenerate native structures are stabilized by subtle balance between different interactions in proteins. In particular, for small proteins, balance between the hydrogen-bond and dipole-dipole interactions happens for proteins of sizes being ∼18 amino acids and is shown to the main driving mechanism for the occurrence of degeneracy. These results provide important clues to the study of native structures of proteins. PMID:22828336
Visual management of large scale data mining projects.
Shah, I; Hunter, L
2000-01-01
This paper describes a unified framework for visualizing the preparations for, and results of, hundreds of machine learning experiments. These experiments were designed to improve the accuracy of enzyme functional predictions from sequence, and in many cases were successful. Our system provides graphical user interfaces for defining and exploring training datasets and various representational alternatives, for inspecting the hypotheses induced by various types of learning algorithms, for visualizing the global results, and for inspecting in detail results for specific training sets (functions) and examples (proteins). The visualization tools serve as a navigational aid through a large amount of sequence data and induced knowledge. They provided significant help in understanding both the significance and the underlying biological explanations of our successes and failures. Using these visualizations it was possible to efficiently identify weaknesses of the modular sequence representations and induction algorithms which suggest better learning strategies. The context in which our data mining visualization toolkit was developed was the problem of accurately predicting enzyme function from protein sequence data. Previous work demonstrated that approximately 6% of enzyme protein sequences are likely to be assigned incorrect functions on the basis of sequence similarity alone. In order to test the hypothesis that more detailed sequence analysis using machine learning techniques and modular domain representations could address many of these failures, we designed a series of more than 250 experiments using information-theoretic decision tree induction and naive Bayesian learning on local sequence domain representations of problematic enzyme function classes. In more than half of these cases, our methods were able to perfectly discriminate among various possible functions of similar sequences. We developed and tested our visualization techniques on this application.
The contour-buildup algorithm to calculate the analytical molecular surface.
Totrov, M; Abagyan, R
1996-01-01
A new algorithm is presented to calculate the analytical molecular surface defined as a smooth envelope traced out by the surface of a probe sphere rolled over the molecule. The core of the algorithm is the sequential build up of multi-arc contours on the van der Waals spheres. This algorithm yields substantial reduction in both memory and time requirements of surface calculations. Further, the contour-buildup principle is intrinsically "local", which makes calculations of the partial molecular surfaces even more efficient. Additionally, the algorithm is equally applicable not only to convex patches, but also to concave triangular patches which may have complex multiple intersections. The algorithm permits the rigorous calculation of the full analytical molecular surface for a 100-residue protein in about 2 seconds on an SGI indigo with R4400++ processor at 150 Mhz, with the performance scaling almost linearly with the protein size. The contour-buildup algorithm is faster than the original Connolly algorithm an order of magnitude.
Clustering PPI data by combining FA and SHC method.
Lei, Xiujuan; Ying, Chao; Wu, Fang-Xiang; Xu, Jin
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value.
Clustering PPI data by combining FA and SHC method
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value. PMID:25707632
Parsimonious Charge Deconvolution for Native Mass Spectrometry
2018-01-01
Charge deconvolution infers the mass from mass over charge (m/z) measurements in electrospray ionization mass spectra. When applied over a wide input m/z or broad target mass range, charge-deconvolution algorithms can produce artifacts, such as false masses at one-half or one-third of the correct mass. Indeed, a maximum entropy term in the objective function of MaxEnt, the most commonly used charge deconvolution algorithm, favors a deconvolved spectrum with many peaks over one with fewer peaks. Here we describe a new “parsimonious” charge deconvolution algorithm that produces fewer artifacts. The algorithm is especially well-suited to high-resolution native mass spectrometry of intact glycoproteins and protein complexes. Deconvolution of native mass spectra poses special challenges due to salt and small molecule adducts, multimers, wide mass ranges, and fewer and lower charge states. We demonstrate the performance of the new deconvolution algorithm on a range of samples. On the heavily glycosylated plasma properdin glycoprotein, the new algorithm could deconvolve monomer and dimer simultaneously and, when focused on the m/z range of the monomer, gave accurate and interpretable masses for glycoforms that had previously been analyzed manually using m/z peaks rather than deconvolved masses. On therapeutic antibodies, the new algorithm facilitated the analysis of extensions, truncations, and Fab glycosylation. The algorithm facilitates the use of native mass spectrometry for the qualitative and quantitative analysis of protein and protein assemblies. PMID:29376659
Cankorur-Cetinkaya, Ayca; Dias, Joao M. L.; Kludas, Jana; Slater, Nigel K. H.; Rousu, Juho; Dikicioglu, Duygu
2017-01-01
Multiple interacting factors affect the performance of engineered biological systems in synthetic biology projects. The complexity of these biological systems means that experimental design should often be treated as a multiparametric optimization problem. However, the available methodologies are either impractical, due to a combinatorial explosion in the number of experiments to be performed, or are inaccessible to most experimentalists due to the lack of publicly available, user-friendly software. Although evolutionary algorithms may be employed as alternative approaches to optimize experimental design, the lack of simple-to-use software again restricts their use to specialist practitioners. In addition, the lack of subsidiary approaches to further investigate critical factors and their interactions prevents the full analysis and exploitation of the biotechnological system. We have addressed these problems and, here, provide a simple‐to‐use and freely available graphical user interface to empower a broad range of experimental biologists to employ complex evolutionary algorithms to optimize their experimental designs. Our approach exploits a Genetic Algorithm to discover the subspace containing the optimal combination of parameters, and Symbolic Regression to construct a model to evaluate the sensitivity of the experiment to each parameter under investigation. We demonstrate the utility of this method using an example in which the culture conditions for the microbial production of a bioactive human protein are optimized. CamOptimus is available through: (https://doi.org/10.17863/CAM.10257). PMID:28635591
Peng, Wei; Wang, Jianxin; Cheng, Yingjiao; Lu, Yu; Wu, Fangxiang; Pan, Yi
2015-01-01
Prediction of essential proteins which are crucial to an organism's survival is important for disease analysis and drug design, as well as the understanding of cellular life. The majority of prediction methods infer the possibility of proteins to be essential by using the network topology. However, these methods are limited to the completeness of available protein-protein interaction (PPI) data and depend on the network accuracy. To overcome these limitations, some computational methods have been proposed. However, seldom of them solve this problem by taking consideration of protein domains. In this work, we first analyze the correlation between the essentiality of proteins and their domain features based on data of 13 species. We find that the proteins containing more protein domain types which rarely occur in other proteins tend to be essential. Accordingly, we propose a new prediction method, named UDoNC, by combining the domain features of proteins with their topological properties in PPI network. In UDoNC, the essentiality of proteins is decided by the number and the frequency of their protein domain types, as well as the essentiality of their adjacent edges measured by edge clustering coefficient. The experimental results on S. cerevisiae data show that UDoNC outperforms other existing methods in terms of area under the curve (AUC). Additionally, UDoNC can also perform well in predicting essential proteins on data of E. coli.
Prediction of essential proteins based on gene expression programming.
Zhong, Jiancheng; Wang, Jianxin; Peng, Wei; Zhang, Zhen; Pan, Yi
2013-01-01
Essential proteins are indispensable for cell survive. Identifying essential proteins is very important for improving our understanding the way of a cell working. There are various types of features related to the essentiality of proteins. Many methods have been proposed to combine some of them to predict essential proteins. However, it is still a big challenge for designing an effective method to predict them by integrating different features, and explaining how these selected features decide the essentiality of protein. Gene expression programming (GEP) is a learning algorithm and what it learns specifically is about relationships between variables in sets of data and then builds models to explain these relationships. In this work, we propose a GEP-based method to predict essential protein by combing some biological features and topological features. We carry out experiments on S. cerevisiae data. The experimental results show that the our method achieves better prediction performance than those methods using individual features. Moreover, our method outperforms some machine learning methods and performs as well as a method which is obtained by combining the outputs of eight machine learning methods. The accuracy of predicting essential proteins can been improved by using GEP method to combine some topological features and biological features.
The Correlation Fractal Dimension of Complex Networks
NASA Astrophysics Data System (ADS)
Wang, Xingyuan; Liu, Zhenzhen; Wang, Mogei
2013-05-01
The fractality of complex networks is studied by estimating the correlation dimensions of the networks. Comparing with the previous algorithms of estimating the box dimension, our algorithm achieves a significant reduction in time complexity. For four benchmark cases tested, that is, the Escherichia coli (E. Coli) metabolic network, the Homo sapiens protein interaction network (H. Sapiens PIN), the Saccharomyces cerevisiae protein interaction network (S. Cerevisiae PIN) and the World Wide Web (WWW), experiments are provided to demonstrate the validity of our algorithm.
ICPD-a new peak detection algorithm for LC/MS.
Zhang, Jianqiu; Haskins, William
2010-12-01
The identification and quantification of proteins using label-free Liquid Chromatography/Mass Spectrometry (LC/MS) play crucial roles in biological and biomedical research. Increasing evidence has shown that biomarkers are often low abundance proteins. However, LC/MS systems are subject to considerable noise and sample variability, whose statistical characteristics are still elusive, making computational identification of low abundance proteins extremely challenging. As a result, the inability of identifying low abundance proteins in a proteomic study is the main bottleneck in protein biomarker discovery. In this paper, we propose a new peak detection method called Information Combining Peak Detection (ICPD ) for high resolution LC/MS. In LC/MS, peptides elute during a certain time period and as a result, peptide isotope patterns are registered in multiple MS scans. The key feature of the new algorithm is that the observed isotope patterns registered in multiple scans are combined together for estimating the likelihood of the peptide existence. An isotope pattern matching score based on the likelihood probability is provided and utilized for peak detection. The performance of the new algorithm is evaluated based on protein standards with 48 known proteins. The evaluation shows better peak detection accuracy for low abundance proteins than other LC/MS peak detection methods.
Li, Bai; Lin, Mu; Liu, Qiao; Li, Ya; Zhou, Changjun
2015-10-01
Protein folding is a fundamental topic in molecular biology. Conventional experimental techniques for protein structure identification or protein folding recognition require strict laboratory requirements and heavy operating burdens, which have largely limited their applications. Alternatively, computer-aided techniques have been developed to optimize protein structures or to predict the protein folding process. In this paper, we utilize a 3D off-lattice model to describe the original protein folding scheme as a simplified energy-optimal numerical problem, where all types of amino acid residues are binarized into hydrophobic and hydrophilic ones. We apply a balance-evolution artificial bee colony (BE-ABC) algorithm as the minimization solver, which is featured by the adaptive adjustment of search intensity to cater for the varying needs during the entire optimization process. In this work, we establish a benchmark case set with 13 real protein sequences from the Protein Data Bank database and evaluate the convergence performance of BE-ABC algorithm through strict comparisons with several state-of-the-art ABC variants in short-term numerical experiments. Besides that, our obtained best-so-far protein structures are compared to the ones in comprehensive previous literature. This study also provides preliminary insights into how artificial intelligence techniques can be applied to reveal the dynamics of protein folding. Graphical Abstract Protein folding optimization using 3D off-lattice model and advanced optimization techniques.
Rabow, A. A.; Scheraga, H. A.
1996-01-01
We have devised a Cartesian combination operator and coding scheme for improving the performance of genetic algorithms applied to the protein folding problem. The genetic coding consists of the C alpha Cartesian coordinates of the protein chain. The recombination of the genes of the parents is accomplished by: (1) a rigid superposition of one parent chain on the other, to make the relation of Cartesian coordinates meaningful, then, (2) the chains of the children are formed through a linear combination of the coordinates of their parents. The children produced with this Cartesian combination operator scheme have similar topology and retain the long-range contacts of their parents. The new scheme is significantly more efficient than the standard genetic algorithm methods for locating low-energy conformations of proteins. The considerable superiority of genetic algorithms over Monte Carlo optimization methods is also demonstrated. We have also devised a new dynamic programming lattice fitting procedure for use with the Cartesian combination operator method. The procedure finds excellent fits of real-space chains to the lattice while satisfying bond-length, bond-angle, and overlap constraints. PMID:8880904
Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi
2017-03-15
Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Pek, Han Bin; Klement, Maximilian; Ang, Kok Siong; Chung, Bevan Kai-Sheng; Ow, Dave Siak-Wei; Lee, Dong-Yup
2015-01-01
Various isoforms of invertases from prokaryotes, fungi, and higher plants has been expressed in Escherichia coli, and codon optimisation is a widely-adopted strategy for improvement of heterologous enzyme expression. Successful synthetic gene design for recombinant protein expression can be done by matching its translational elongation rate against heterologous host organisms via codon optimization. Amongst the various design parameters considered for the gene synthesis, codon context bias has been relatively overlooked compared to individual codon usage which is commonly adopted in most of codon optimization tools. In addition, matching the rates of transcription and translation based on secondary structure may lead to enhanced protein folding. In this study, we evaluated codon context fitness as design criterion for improving the expression of thermostable invertase from Thermotoga maritima in Escherichia coli and explored the relevance of secondary structure regions for folding and expression. We designed three coding sequences by using (1) a commercial vendor optimized gene algorithm, (2) codon context for the whole gene, and (3) codon context based on the secondary structure regions. Then, the codon optimized sequences were transformed and expressed in E. coli. From the resultant enzyme activities and protein yield data, codon context fitness proved to have the highest activity as compared to the wild-type control and other criteria while secondary structure-based strategy is comparable to the control. Codon context bias was shown to be a relevant parameter for enhancing enzyme production in Escherichia coli by codon optimization. Thus, we can effectively design synthetic genes within heterologous host organisms using this criterion. Copyright © 2015 Elsevier Inc. All rights reserved.
Querying graphs in protein-protein interactions networks using feedback vertex set.
Blin, Guillaume; Sikora, Florian; Vialette, Stéphane
2010-01-01
Recent techniques increase rapidly the amount of our knowledge on interactions between proteins. The interpretation of these new information depends on our ability to retrieve known substructures in the data, the Protein-Protein Interactions (PPIs) networks. In an algorithmic point of view, it is an hard task since it often leads to NP-hard problems. To overcome this difficulty, many authors have provided tools for querying patterns with a restricted topology, i.e., paths or trees in PPI networks. Such restriction leads to the development of fixed parameter tractable (FPT) algorithms, which can be practicable for restricted sizes of queries. Unfortunately, Graph Homomorphism is a W[1]-hard problem, and hence, no FPT algorithm can be found when patterns are in the shape of general graphs. However, Dost et al. gave an algorithm (which is not implemented) to query graphs with a bounded treewidth in PPI networks (the treewidth of the query being involved in the time complexity). In this paper, we propose another algorithm for querying pattern in the shape of graphs, also based on dynamic programming and the color-coding technique. To transform graphs queries into trees without loss of informations, we use feedback vertex set coupled to a node duplication mechanism. Hence, our algorithm is FPT for querying graphs with a bounded size of their feedback vertex set. It gives an alternative to the treewidth parameter, which can be better or worst for a given query. We provide a python implementation which allows us to validate our implementation on real data. Especially, we retrieve some human queries in the shape of graphs into the fly PPI network.
Konc, Janez; Janežič, Dušanka
2017-09-01
ProBiS (Protein Binding Sites) Tools consist of algorithm, database, and web servers for prediction of binding sites and protein ligands based on the detection of structurally similar binding sites in the Protein Data Bank. In this article, we review the operations that ProBiS Tools perform, provide comments on the evolution of the tools, and give some implementation details. We review some of its applications to biologically interesting proteins. ProBiS Tools are freely available at http://probis.cmm.ki.si and http://probis.nih.gov. Copyright © 2017 Elsevier Ltd. All rights reserved.
Shen, Xianjun; Yi, Li; Jiang, Xingpeng; He, Tingting; Yang, Jincai; Xie, Wei; Hu, Po; Hu, Xiaohua
2017-01-01
How to identify protein complex is an important and challenging task in proteomics. It would make great contribution to our knowledge of molecular mechanism in cell life activities. However, the inherent organization and dynamic characteristic of cell system have rarely been incorporated into the existing algorithms for detecting protein complexes because of the limitation of protein-protein interaction (PPI) data produced by high throughput techniques. The availability of time course gene expression profile enables us to uncover the dynamics of molecular networks and improve the detection of protein complexes. In order to achieve this goal, this paper proposes a novel algorithm DCA (Dynamic Core-Attachment). It detects protein-complex core comprising of continually expressed and highly connected proteins in dynamic PPI network, and then the protein complex is formed by including the attachments with high adhesion into the core. The integration of core-attachment feature into the dynamic PPI network is responsible for the superiority of our algorithm. DCA has been applied on two different yeast dynamic PPI networks and the experimental results show that it performs significantly better than the state-of-the-art techniques in terms of prediction accuracy, hF-measure and statistical significance in biology. In addition, the identified complexes with strong biological significance provide potential candidate complexes for biologists to validate.
Ebrahimi, Mansour; Aghagolzadeh, Parisa; Shamabadi, Narges; Tahmasebi, Ahmad; Alsharifi, Mohammed; Adelson, David L; Hemmatzadeh, Farhid; Ebrahimie, Esmaeil
2014-01-01
The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning models which can predict HA subtypes, various Decision Tree, Support Vector Machine, Naïve Bayes, and Neural Network models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross validation. The results highlighted the frequency of Gln (selected by 80% of attribute weighting algorithms), percentage/frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus through supervised algorithms represent a new avenue for understanding and predicting possible future structure of influenza pandemics.
Ebrahimi, Mansour; Aghagolzadeh, Parisa; Shamabadi, Narges; Tahmasebi, Ahmad; Alsharifi, Mohammed; Adelson, David L.
2014-01-01
The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning models which can predict HA subtypes, various Decision Tree, Support Vector Machine, Naïve Bayes, and Neural Network models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross validation. The results highlighted the frequency of Gln (selected by 80% of attribute weighting algorithms), percentage/frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus through supervised algorithms represent a new avenue for understanding and predicting possible future structure of influenza pandemics. PMID:24809455
He, Jieyue; Li, Chaojun; Ye, Baoliu; Zhong, Wei
2012-06-25
Most computational algorithms mainly focus on detecting highly connected subgraphs in PPI networks as protein complexes but ignore their inherent organization. Furthermore, many of these algorithms are computationally expensive. However, recent analysis indicates that experimentally detected protein complexes generally contain Core/attachment structures. In this paper, a Greedy Search Method based on Core-Attachment structure (GSM-CA) is proposed. The GSM-CA method detects densely connected regions in large protein-protein interaction networks based on the edge weight and two criteria for determining core nodes and attachment nodes. The GSM-CA method improves the prediction accuracy compared to other similar module detection approaches, however it is computationally expensive. Many module detection approaches are based on the traditional hierarchical methods, which is also computationally inefficient because the hierarchical tree structure produced by these approaches cannot provide adequate information to identify whether a network belongs to a module structure or not. In order to speed up the computational process, the Greedy Search Method based on Fast Clustering (GSM-FC) is proposed in this work. The edge weight based GSM-FC method uses a greedy procedure to traverse all edges just once to separate the network into the suitable set of modules. The proposed methods are applied to the protein interaction network of S. cerevisiae. Experimental results indicate that many significant functional modules are detected, most of which match the known complexes. Results also demonstrate that the GSM-FC algorithm is faster and more accurate as compared to other competing algorithms. Based on the new edge weight definition, the proposed algorithm takes advantages of the greedy search procedure to separate the network into the suitable set of modules. Experimental analysis shows that the identified modules are statistically significant. The algorithm can reduce the computational time significantly while keeping high prediction accuracy.
NASA Astrophysics Data System (ADS)
Miao, Xijiang; Mukhopadhyay, Rishi; Valafar, Homayoun
2008-10-01
Advances in NMR instrumentation and pulse sequence design have resulted in easier acquisition of Residual Dipolar Coupling (RDC) data. However, computational and theoretical analysis of this type of data has continued to challenge the international community of investigators because of their complexity and rich information content. Contemporary use of RDC data has required a-priori assignment, which significantly increases the overall cost of structural analysis. This article introduces a novel algorithm that utilizes unassigned RDC data acquired from multiple alignment media ( nD-RDC, n ⩾ 3) for simultaneous extraction of the relative order tensor matrices and reconstruction of the interacting vectors in space. Estimation of the relative order tensors and reconstruction of the interacting vectors can be invaluable in a number of endeavors. An example application has been presented where the reconstructed vectors have been used to quantify the fitness of a template protein structure to the unknown protein structure. This work has other important direct applications such as verification of the novelty of an unknown protein and validation of the accuracy of an available protein structure model in drug design. More importantly, the presented work has the potential to bridge the gap between experimental and computational methods of structure determination.
SIFTER search: a web server for accurate phylogeny-based protein function prediction
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.
We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less
Arana-Daniel, Nancy; Gallegos, Alberto A; López-Franco, Carlos; Alanís, Alma Y; Morales, Jacob; López-Franco, Adriana
2016-01-01
With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.
SIFTER search: a web server for accurate phylogeny-based protein function prediction
Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.
2015-05-15
We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less
Identifying technical aliases in SELDI mass spectra of complex mixtures of proteins
2013-01-01
Background Biomarker discovery datasets created using mass spectrum protein profiling of complex mixtures of proteins contain many peaks that represent the same protein with different charge states. Correlated variables such as these can confound the statistical analyses of proteomic data. Previously we developed an algorithm that clustered mass spectrum peaks that were biologically or technically correlated. Here we demonstrate an algorithm that clusters correlated technical aliases only. Results In this paper, we propose a preprocessing algorithm that can be used for grouping technical aliases in mass spectrometry protein profiling data. The stringency of the variance allowed for clustering is customizable, thereby affecting the number of peaks that are clustered. Subsequent analysis of the clusters, instead of individual peaks, helps reduce difficulties associated with technically-correlated data, and can aid more efficient biomarker identification. Conclusions This software can be used to pre-process and thereby decrease the complexity of protein profiling proteomics data, thus simplifying the subsequent analysis of biomarkers by decreasing the number of tests. The software is also a practical tool for identifying which features to investigate further by purification, identification and confirmation. PMID:24010718
Protein structure prediction with local adjust tabu search algorithm
2014-01-01
Background Protein folding structure prediction is one of the most challenging problems in the bioinformatics domain. Because of the complexity of the realistic protein structure, the simplified structure model and the computational method should be adopted in the research. The AB off-lattice model is one of the simplification models, which only considers two classes of amino acids, hydrophobic (A) residues and hydrophilic (B) residues. Results The main work of this paper is to discuss how to optimize the lowest energy configurations in 2D off-lattice model and 3D off-lattice model by using Fibonacci sequences and real protein sequences. In order to avoid falling into local minimum and faster convergence to the global minimum, we introduce a novel method (SATS) to the protein structure problem, which combines simulated annealing algorithm and tabu search algorithm. Various strategies, such as the new encoding strategy, the adaptive neighborhood generation strategy and the local adjustment strategy, are adopted successfully for high-speed searching the optimal conformation corresponds to the lowest energy of the protein sequences. Experimental results show that some of the results obtained by the improved SATS are better than those reported in previous literatures, and we can sure that the lowest energy folding state for short Fibonacci sequences have been found. Conclusions Although the off-lattice models is not very realistic, they can reflect some important characteristics of the realistic protein. It can be found that 3D off-lattice model is more like native folding structure of the realistic protein than 2D off-lattice model. In addition, compared with some previous researches, the proposed hybrid algorithm can more effectively and more quickly search the spatial folding structure of a protein chain. PMID:25474708
Protein-protein docking using region-based 3D Zernike descriptors
2009-01-01
Background Protein-protein interactions are a pivotal component of many biological processes and mediate a variety of functions. Knowing the tertiary structure of a protein complex is therefore essential for understanding the interaction mechanism. However, experimental techniques to solve the structure of the complex are often found to be difficult. To this end, computational protein-protein docking approaches can provide a useful alternative to address this issue. Prediction of docking conformations relies on methods that effectively capture shape features of the participating proteins while giving due consideration to conformational changes that may occur. Results We present a novel protein docking algorithm based on the use of 3D Zernike descriptors as regional features of molecular shape. The key motivation of using these descriptors is their invariance to transformation, in addition to a compact representation of local surface shape characteristics. Docking decoys are generated using geometric hashing, which are then ranked by a scoring function that incorporates a buried surface area and a novel geometric complementarity term based on normals associated with the 3D Zernike shape description. Our docking algorithm was tested on both bound and unbound cases in the ZDOCK benchmark 2.0 dataset. In 74% of the bound docking predictions, our method was able to find a near-native solution (interface C-αRMSD ≤ 2.5 Å) within the top 1000 ranks. For unbound docking, among the 60 complexes for which our algorithm returned at least one hit, 60% of the cases were ranked within the top 2000. Comparison with existing shape-based docking algorithms shows that our method has a better performance than the others in unbound docking while remaining competitive for bound docking cases. Conclusion We show for the first time that the 3D Zernike descriptors are adept in capturing shape complementarity at the protein-protein interface and useful for protein docking prediction. Rigorous benchmark studies show that our docking approach has a superior performance compared to existing methods. PMID:20003235
Protein-protein docking using region-based 3D Zernike descriptors.
Venkatraman, Vishwesh; Yang, Yifeng D; Sael, Lee; Kihara, Daisuke
2009-12-09
Protein-protein interactions are a pivotal component of many biological processes and mediate a variety of functions. Knowing the tertiary structure of a protein complex is therefore essential for understanding the interaction mechanism. However, experimental techniques to solve the structure of the complex are often found to be difficult. To this end, computational protein-protein docking approaches can provide a useful alternative to address this issue. Prediction of docking conformations relies on methods that effectively capture shape features of the participating proteins while giving due consideration to conformational changes that may occur. We present a novel protein docking algorithm based on the use of 3D Zernike descriptors as regional features of molecular shape. The key motivation of using these descriptors is their invariance to transformation, in addition to a compact representation of local surface shape characteristics. Docking decoys are generated using geometric hashing, which are then ranked by a scoring function that incorporates a buried surface area and a novel geometric complementarity term based on normals associated with the 3D Zernike shape description. Our docking algorithm was tested on both bound and unbound cases in the ZDOCK benchmark 2.0 dataset. In 74% of the bound docking predictions, our method was able to find a near-native solution (interface C-alphaRMSD < or = 2.5 A) within the top 1000 ranks. For unbound docking, among the 60 complexes for which our algorithm returned at least one hit, 60% of the cases were ranked within the top 2000. Comparison with existing shape-based docking algorithms shows that our method has a better performance than the others in unbound docking while remaining competitive for bound docking cases. We show for the first time that the 3D Zernike descriptors are adept in capturing shape complementarity at the protein-protein interface and useful for protein docking prediction. Rigorous benchmark studies show that our docking approach has a superior performance compared to existing methods.
Spencer, Jean L; Bhatia, Vivek N; Whelan, Stephen A; Costello, Catherine E; McComb, Mark E
2013-12-01
The identification of protein post-translational modifications (PTMs) is an increasingly important component of proteomics and biomarker discovery, but very few tools exist for performing fast and easy characterization of global PTM changes and differential comparison of PTMs across groups of data obtained from liquid chromatography-tandem mass spectrometry experiments. STRAP PTM (Software Tool for Rapid Annotation of Proteins: Post-Translational Modification edition) is a program that was developed to facilitate the characterization of PTMs using spectral counting and a novel scoring algorithm to accelerate the identification of differential PTMs from complex data sets. The software facilitates multi-sample comparison by collating, scoring, and ranking PTMs and by summarizing data visually. The freely available software (beta release) installs on a PC and processes data in protXML format obtained from files parsed through the Trans-Proteomic Pipeline. The easy-to-use interface allows examination of results at protein, peptide, and PTM levels, and the overall design offers tremendous flexibility that provides proteomics insight beyond simple assignment and counting.
Extended Phase-Space Methods for Enhanced Sampling in Molecular Simulations: A Review.
Fujisaki, Hiroshi; Moritsugu, Kei; Matsunaga, Yasuhiro; Morishita, Tetsuya; Maragliano, Luca
2015-01-01
Molecular Dynamics simulations are a powerful approach to study biomolecular conformational changes or protein-ligand, protein-protein, and protein-DNA/RNA interactions. Straightforward applications, however, are often hampered by incomplete sampling, since in a typical simulated trajectory the system will spend most of its time trapped by high energy barriers in restricted regions of the configuration space. Over the years, several techniques have been designed to overcome this problem and enhance space sampling. Here, we review a class of methods that rely on the idea of extending the set of dynamical variables of the system by adding extra ones associated to functions describing the process under study. In particular, we illustrate the Temperature Accelerated Molecular Dynamics (TAMD), Logarithmic Mean Force Dynamics (LogMFD), and Multiscale Enhanced Sampling (MSES) algorithms. We also discuss combinations with techniques for searching reaction paths. We show the advantages presented by this approach and how it allows to quickly sample important regions of the free-energy landscape via automatic exploration.
Population-based metaheuristic optimization in neutron optics and shielding design
NASA Astrophysics Data System (ADS)
DiJulio, D. D.; Björgvinsdóttir, H.; Zendler, C.; Bentley, P. M.
2016-11-01
Population-based metaheuristic algorithms are powerful tools in the design of neutron scattering instruments and the use of these types of algorithms for this purpose is becoming more and more commonplace. Today there exists a wide range of algorithms to choose from when designing an instrument and it is not always initially clear which may provide the best performance. Furthermore, due to the nature of these types of algorithms, the final solution found for a specific design scenario cannot always be guaranteed to be the global optimum. Therefore, to explore the potential benefits and differences between the varieties of these algorithms available, when applied to such design scenarios, we have carried out a detailed study of some commonly used algorithms. For this purpose, we have developed a new general optimization software package which combines a number of common metaheuristic algorithms within a single user interface and is designed specifically with neutronic calculations in mind. The algorithms included in the software are implementations of Particle-Swarm Optimization (PSO), Differential Evolution (DE), Artificial Bee Colony (ABC), and a Genetic Algorithm (GA). The software has been used to optimize the design of several problems in neutron optics and shielding, coupled with Monte-Carlo simulations, in order to evaluate the performance of the various algorithms. Generally, the performance of the algorithms depended on the specific scenarios, however it was found that DE provided the best average solutions in all scenarios investigated in this work.
HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy.
Hu, Huan; Zhang, Li; Ai, Haixin; Zhang, Hui; Fan, Yetian; Zhao, Qi; Liu, Hongsheng
2018-03-27
LncRNA plays an important role in many biological and disease progression by binding to related proteins. However, the experimental methods for studying lncRNA-protein interactions are time-consuming and expensive. Although there are a few models designed to predict the interactions of ncRNA-protein, they all have some common drawbacks that limit their predictive performance. In this study, we present a model called HLPI-Ensemble designed specifically for human lncRNA-protein interactions. HLPI-Ensemble adopts the ensemble strategy based on three mainstream machine learning algorithms of Support Vector Machines (SVM), Random Forests (RF) and Extreme Gradient Boosting (XGB) to generate HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble, respectively. The results of 10-fold cross-validation show that HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble achieved AUCs of 0.95, 0.96 and 0.96, respectively, in the test dataset. Furthermore, we compared the performance of the HLPI-Ensemble models with the previous models through external validation dataset. The results show that the false positives (FPs) of HLPI-Ensemble models are much lower than that of the previous models, and other evaluation indicators of HLPI-Ensemble models are also higher than those of the previous models. It is further showed that HLPI-Ensemble models are superior in predicting human lncRNA-protein interaction compared with previous models. The HLPI-Ensemble is publicly available at: http://ccsipb.lnu.edu.cn/hlpiensemble/ .
Sahin, Erinc; Jordan, Jacob L; Spatara, Michelle L; Naranjo, Andrea; Costanzo, Joseph A; Weiss, William F; Robinson, Anne Skaja; Fernandez, Erik J; Roberts, Christopher J
2011-02-08
γD crystallin is a natively monomeric eye-lens protein that is associated with hereditary juvenile cataract formation. It is an attractive model system as a multidomain Greek-key protein that aggregates through partially folded intermediates. Point mutations M69Q and S130P were used to test (1) whether the protein-design algorithm RosettaDesign would successfully predict mutants that are resistant to aggregation when combined with informatic sequence-based predictors of peptide aggregation propensity and (2) how the mutations affected relative unfolding free energies (ΔΔG(un)) and intrinsic aggregation propensity (IAP). M69Q was predicted to have ΔΔG(un) ≫ 0, without significantly affecting IAP. S130P was predicted to have ΔΔG(un) ∼ 0 but with reduced IAP. The stability, conformation, and aggregation kinetics in acidic solution were experimentally characterized and compared for the variants and wild-type (WT) protein using circular dichroism and intrinsic fluorescence spectroscopy, calorimetric and chemical unfolding, thioflavin-T binding, chromatography, static laser light scattering, and kinetic modeling. Monomer secondary and tertiary structures of both variants were indistinguishable from WT, while ΔΔG(un) > 0 for M69Q and ΔΔG(un) < 0 for S130P. Surprisingly, despite being the least conformationally stable, S130P was the most resistant to aggregation, indicating a significant decrease of its IAP compared to WT and M69Q.
Specificity determinants for the abscisic acid response element.
Sarkar, Aditya Kumar; Lahiri, Ansuman
2013-01-01
Abscisic acid (ABA) response elements (ABREs) are a group of cis-acting DNA elements that have been identified from promoter analysis of many ABA-regulated genes in plants. We are interested in understanding the mechanism of binding specificity between ABREs and a class of bZIP transcription factors known as ABRE binding factors (ABFs). In this work, we have modeled the homodimeric structure of the bZIP domain of ABRE binding factor 1 from Arabidopsis thaliana (AtABF1) and studied its interaction with ACGT core motif-containing ABRE sequences. We have also examined the variation in the stability of the protein-DNA complex upon mutating ABRE sequences using the protein design algorithm FoldX. The high throughput free energy calculations successfully predicted the ability of ABF1 to bind to alternative core motifs like GCGT or AAGT and also rationalized the role of the flanking sequences in determining the specificity of the protein-DNA interaction.
Automated identification of functional dynamic networks from X-ray crystallography
van den Bedem, Henry; Bhabha, Gira; Yang, Kun; Wright, Peter E.; Fraser, James S.
2013-01-01
Protein function often depends on the exchange between conformational substates. Allosteric ligand binding or distal mutations can stabilize specific active site conformations and consequently alter protein function. In addition to comparing independently determined X-ray crystal structures, alternative conformations observed at low levels of electron density have the potential to provide mechanistic insights into conformational dynamics. Here, we report a new multi-conformer contact network algorithm (CONTACT) that identifies networks of conformationally heterogeneous residues directly from high-resolution X-ray crystallography data. Contact networks in Escherichia coli dihydrofolate reductase (ecDHFR) predict the long-range pattern of NMR chemical shift perturbations of an allosteric mutation. A comparison of contact networks in wild type and mutant ecDHFR suggests how mutations that alter optimized networks of coordinated motions can impair catalytic function. Thus, CONTACT-guided mutagenesis will allow the structure-dynamics-function relationship to be exploited in protein engineering and design. PMID:23913260
Refined Genetic Algorithms for Polypeptide Structure Prediction.
1996-12-01
16 I I I. Algorithm Analysis, Design , and Implemen tation : : : : : : : : : : : : : : : : : : : : : : : : : 18 3.1 Analysis...21 3.2 Algorithm Design and Implemen tation : : : : : : : : : : : : : : : : : : : : : : : : : 22 3.2.1...26 IV. Exp erimen t Design
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Brouwer, Randall Jay
1991-01-01
The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Pisanti, Nadia; Soldano, Henry; Carpentier, Mathilde; Pothier, Joel
2009-12-01
The geometrical configurations of atoms in protein structures can be viewed as approximate relations among them. Then, finding similar common substructures within a set of protein structures belongs to a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of positions of the motifs. We will hence denote them as relational motifs. For this class of problems, we present an algorithm that is a suitable extension of the KMR paradigm and, in particular, of the KMRC as it uses a degenerate alphabet. Our algorithm contains several improvements that become especially useful when-as it is required for relational motifs-the inference is made by partially overlapping shorter motifs, rather than concatenating them. The efficiency, correctness and completeness of the algorithm is ensured by several non-trivial properties that are proven in this paper. The algorithm has been applied in the important field of protein common 3D substructure searching. The methods implemented have been tested on several examples of protein families such as serine proteases, globins and cytochromes P450 additionally. The detected motifs have been compared to those found by multiple structural alignments methods.
Dong, Runze; Pan, Shuo; Peng, Zhenling; Zhang, Yang; Yang, Jianyi
2018-05-21
With the rapid increase of the number of protein structures in the Protein Data Bank, it becomes urgent to develop algorithms for efficient protein structure comparisons. In this article, we present the mTM-align server, which consists of two closely related modules: one for structure database search and the other for multiple structure alignment. The database search is speeded up based on a heuristic algorithm and a hierarchical organization of the structures in the database. The multiple structure alignment is performed using the recently developed algorithm mTM-align. Benchmark tests demonstrate that our algorithms outperform other peering methods for both modules, in terms of speed and accuracy. One of the unique features for the server is the interplay between database search and multiple structure alignment. The server provides service not only for performing fast database search, but also for making accurate multiple structure alignment with the structures found by the search. For the database search, it takes about 2-5 min for a structure of a medium size (∼300 residues). For the multiple structure alignment, it takes a few seconds for ∼10 structures of medium sizes. The server is freely available at: http://yanglab.nankai.edu.cn/mTM-align/.
A Comparative Study of Optimization Algorithms for Engineering Synthesis.
1983-03-01
the ADS program demonstrates the flexibility a design engineer would have in selecting an optimization algorithm best suited to solve a particular...demonstrates the flexibility a design engineer would have in selecting an optimization algorithm best suited to solve a particular problem. 4 TABLE OF...algorithm to suit a particular problem. The ADS library of design optimization algorithms was . developed by Vanderplaats in response to the first
GBA manager: an online tool for querying low-complexity regions in proteins.
Bandyopadhyay, Nirmalya; Kahveci, Tamer
2010-01-01
Abstract We developed GBA Manager, an online software that facilitates the Graph-Based Algorithm (GBA) we proposed in our earlier work. GBA identifies the low-complexity regions (LCR) of protein sequences. GBA exploits a similarity matrix, such as BLOSUM62, to compute the complexity of the subsequences of the input protein sequence. It uses a graph-based algorithm to accurately compute the regions that have low complexities. GBA Manager is a user friendly web-service that enables online querying of protein sequences using GBA. In addition to querying capabilities of the existing GBA algorithm, GBA Manager computes the p-values of the LCR identified. The p-value gives an estimate of the possibility that the region appears by chance. GBA Manager presents the output in three different understandable formats. GBA Manager is freely accessible at http://bioinformatics.cise.ufl.edu/GBA/GBA.htm .
Application of a fast sorting algorithm to the assignment of mass spectrometric cross-linking data.
Petrotchenko, Evgeniy V; Borchers, Christoph H
2014-09-01
Cross-linking combined with MS involves enzymatic digestion of cross-linked proteins and identifying cross-linked peptides. Assignment of cross-linked peptide masses requires a search of all possible binary combinations of peptides from the cross-linked proteins' sequences, which becomes impractical with increasing complexity of the protein system and/or if digestion enzyme specificity is relaxed. Here, we describe the application of a fast sorting algorithm to search large sequence databases for cross-linked peptide assignments based on mass. This same algorithm has been used previously for assigning disulfide-bridged peptides (Choi et al., ), but has not previously been applied to cross-linking studies. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
The potential of genetic algorithms for conceptual design of rotor systems
NASA Technical Reports Server (NTRS)
Crossley, William A.; Wells, Valana L.; Laananen, David H.
1993-01-01
The capabilities of genetic algorithms as a non-calculus based, global search method make them potentially useful in the conceptual design of rotor systems. Coupling reasonably simple analysis tools to the genetic algorithm was accomplished, and the resulting program was used to generate designs for rotor systems to match requirements similar to those of both an existing helicopter and a proposed helicopter design. This provides a comparison with the existing design and also provides insight into the potential of genetic algorithms in design of new rotors.
SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data
Dotu, Ivan; Adamson, Scott I.; Coleman, Benjamin; Fournier, Cyril; Ricart-Altimiras, Emma; Eyras, Eduardo
2018-01-01
RNA-protein binding is critical to gene regulation, controlling fundamental processes including splicing, translation, localization and stability, and aberrant RNA-protein interactions are known to play a role in a wide variety of diseases. However, molecular understanding of RNA-protein interactions remains limited; in particular, identification of RNA motifs that bind proteins has long been challenging, especially when such motifs depend on both sequence and structure. Moreover, although RNA binding proteins (RBPs) often contain more than one binding domain, algorithms capable of identifying more than one binding motif simultaneously have not been developed. In this paper we present a novel pipeline to determine binding peaks in crosslinking immunoprecipitation (CLIP) data, to discover multiple possible RNA sequence/structure motifs among them, and to experimentally validate such motifs. At the core is a new semi-automatic algorithm SARNAclust, the first unsupervised method to identify and deconvolve multiple sequence/structure motifs simultaneously. SARNAclust computes similarity between sequence/structure objects using a graph kernel, providing the ability to isolate the impact of specific features through the bulge graph formalism. Application of SARNAclust to synthetic data shows its capability of clustering 5 motifs at once with a V-measure value of over 0.95, while GraphClust achieves only a V-measure of 0.083 and RNAcontext cannot detect any of the motifs. When applied to existing eCLIP sets, SARNAclust finds known motifs for SLBP and HNRNPC and novel motifs for several other RBPs such as AGGF1, AKAP8L and ILF3. We demonstrate an experimental validation protocol, a targeted Bind-n-Seq-like high-throughput sequencing approach that relies on RNA inverse folding for oligo pool design, that can validate the components within the SLBP motif. Finally, we use this protocol to experimentally interrogate the SARNAclust motif predictions for protein ILF3. Our results support a newly identified partially double-stranded UUUUUGAGA motif similar to that known for the splicing factor HNRNPC. PMID:29596423
Kinjo, Akira R.; Nakamura, Haruki
2012-01-01
Comparison and classification of protein structures are fundamental means to understand protein functions. Due to the computational difficulty and the ever-increasing amount of structural data, however, it is in general not feasible to perform exhaustive all-against-all structure comparisons necessary for comprehensive classifications. To efficiently handle such situations, we have previously proposed a method, now called GIRAF. We herein describe further improvements in the GIRAF protein structure search and alignment method. The GIRAF method achieves extremely efficient search of similar structures of ligand binding sites of proteins by exploiting database indexing of structural features of local coordinate frames. In addition, it produces refined atom-wise alignments by iterative applications of the Hungarian method to the bipartite graph defined for a pair of superimposed structures. By combining the refined alignments based on different local coordinate frames, it is made possible to align structures involving domain movements. We provide detailed accounts for the database design, the search and alignment algorithms as well as some benchmark results. PMID:27493524
MutaBind estimates and interprets the effects of sequence variants on protein-protein interactions.
Li, Minghui; Simonetti, Franco L; Goncearenco, Alexander; Panchenko, Anna R
2016-07-08
Proteins engage in highly selective interactions with their macromolecular partners. Sequence variants that alter protein binding affinity may cause significant perturbations or complete abolishment of function, potentially leading to diseases. There exists a persistent need to develop a mechanistic understanding of impacts of variants on proteins. To address this need we introduce a new computational method MutaBind to evaluate the effects of sequence variants and disease mutations on protein interactions and calculate the quantitative changes in binding affinity. The MutaBind method uses molecular mechanics force fields, statistical potentials and fast side-chain optimization algorithms. The MutaBind server maps mutations on a structural protein complex, calculates the associated changes in binding affinity, determines the deleterious effect of a mutation, estimates the confidence of this prediction and produces a mutant structural model for download. MutaBind can be applied to a large number of problems, including determination of potential driver mutations in cancer and other diseases, elucidation of the effects of sequence variants on protein fitness in evolution and protein design. MutaBind is available at http://www.ncbi.nlm.nih.gov/projects/mutabind/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
jMetalCpp: optimizing molecular docking problems with a C++ metaheuristic framework.
López-Camacho, Esteban; García Godoy, María Jesús; Nebro, Antonio J; Aldana-Montes, José F
2014-02-01
Molecular docking is a method for structure-based drug design and structural molecular biology, which attempts to predict the position and orientation of a small molecule (ligand) in relation to a protein (receptor) to produce a stable complex with a minimum binding energy. One of the most widely used software packages for this purpose is AutoDock, which incorporates three metaheuristic techniques. We propose the integration of AutoDock with jMetalCpp, an optimization framework, thereby providing both single- and multi-objective algorithms that can be used to effectively solve docking problems. The resulting combination of AutoDock + jMetalCpp allows users of the former to easily use the metaheuristics provided by the latter. In this way, biologists have at their disposal a richer set of optimization techniques than those already provided in AutoDock. Moreover, designers of metaheuristic techniques can use molecular docking for case studies, which can lead to more efficient algorithms oriented to solving the target problems. jMetalCpp software adapted to AutoDock is freely available as a C++ source code at http://khaos.uma.es/AutodockjMetal/.
Azimipour, Mehdi; Sheikhzadeh, Mahya; Baumgartner, Ryan; Cullen, Patrick K; Helmstetter, Fred J; Chang, Woo-Jin; Pashaie, Ramin
2017-01-01
We present our effort in implementing a fluorescence laminar optical tomography scanner which is specifically designed for noninvasive three-dimensional imaging of fluorescence proteins in the brains of small rodents. A laser beam, after passing through a cylindrical lens, scans the brain tissue from the surface while the emission signal is captured by the epi-fluorescence optics and is recorded using an electron multiplication CCD sensor. Image reconstruction algorithms are developed based on Monte Carlo simulation to model light–tissue interaction and generate the sensitivity matrices. To solve the inverse problem, we used the iterative simultaneous algebraic reconstruction technique. The performance of the developed system was evaluated by imaging microfabricated silicon microchannels embedded inside a substrate with optical properties close to the brain as a tissue phantom and ultimately by scanning brain tissue in vivo. Details of the hardware design and reconstruction algorithms are discussed and several experimental results are presented. The developed system can specifically facilitate neuroscience experiments where fluorescence imaging and molecular genetic methods are used to study the dynamics of the brain circuitries.
A graphic method for identification of novel glioma related genes.
Gao, Yu-Fei; Shu, Yang; Yang, Lei; He, Yi-Chun; Li, Li-Peng; Huang, GuaHua; Li, Hai-Peng; Jiang, Yang
2014-01-01
Glioma, as the most common and lethal intracranial tumor, is a serious disease that causes many deaths every year. Good comprehension of the mechanism underlying this disease is very helpful to design effective treatments. However, up to now, the knowledge of this disease is still limited. It is an important step to understand the mechanism underlying this disease by uncovering its related genes. In this study, a graphic method was proposed to identify novel glioma related genes based on known glioma related genes. A weighted graph was constructed according to the protein-protein interaction information retrieved from STRING and the well-known shortest path algorithm was employed to discover novel genes. The following analysis suggests that some of them are related to the biological process of glioma, proving that our method was effective in identifying novel glioma related genes. We hope that the proposed method would be applied to study other diseases and provide useful information to medical workers, thereby designing effective treatments of different diseases.
ICPD-A New Peak Detection Algorithm for LC/MS
2010-01-01
Background The identification and quantification of proteins using label-free Liquid Chromatography/Mass Spectrometry (LC/MS) play crucial roles in biological and biomedical research. Increasing evidence has shown that biomarkers are often low abundance proteins. However, LC/MS systems are subject to considerable noise and sample variability, whose statistical characteristics are still elusive, making computational identification of low abundance proteins extremely challenging. As a result, the inability of identifying low abundance proteins in a proteomic study is the main bottleneck in protein biomarker discovery. Results In this paper, we propose a new peak detection method called Information Combining Peak Detection (ICPD ) for high resolution LC/MS. In LC/MS, peptides elute during a certain time period and as a result, peptide isotope patterns are registered in multiple MS scans. The key feature of the new algorithm is that the observed isotope patterns registered in multiple scans are combined together for estimating the likelihood of the peptide existence. An isotope pattern matching score based on the likelihood probability is provided and utilized for peak detection. Conclusions The performance of the new algorithm is evaluated based on protein standards with 48 known proteins. The evaluation shows better peak detection accuracy for low abundance proteins than other LC/MS peak detection methods. PMID:21143790
Discrete size optimization of steel trusses using a refined big bang-big crunch algorithm
NASA Astrophysics Data System (ADS)
Hasançebi, O.; Kazemzadeh Azad, S.
2014-01-01
This article presents a methodology that provides a method for design optimization of steel truss structures based on a refined big bang-big crunch (BB-BC) algorithm. It is shown that a standard formulation of the BB-BC algorithm occasionally falls short of producing acceptable solutions to problems from discrete size optimum design of steel trusses. A reformulation of the algorithm is proposed and implemented for design optimization of various discrete truss structures according to American Institute of Steel Construction Allowable Stress Design (AISC-ASD) specifications. Furthermore, the performance of the proposed BB-BC algorithm is compared to its standard version as well as other well-known metaheuristic techniques. The numerical results confirm the efficiency of the proposed algorithm in practical design optimization of truss structures.
Robust prediction of protein subcellular localization combining PCA and WSVMs.
Tian, Jiang; Gu, Hong; Liu, Wenqi; Gao, Chiyang
2011-08-01
Automated prediction of protein subcellular localization is an important tool for genome annotation and drug discovery, and Support Vector Machines (SVMs) can effectively solve this problem in a supervised manner. However, the datasets obtained from real experiments are likely to contain outliers or noises, which can lead to poor generalization ability and classification accuracy. To explore this problem, we adopt strategies to lower the effect of outliers. First we design a method based on Weighted SVMs, different weights are assigned to different data points, so the training algorithm will learn the decision boundary according to the relative importance of the data points. Second we analyse the influence of Principal Component Analysis (PCA) on WSVM classification, propose a hybrid classifier combining merits of both PCA and WSVM. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means algorithm can generate more suitable weights for the training, as PCA transforms the data into a new coordinate system with largest variances affected greatly by the outliers. Experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy. Copyright © 2011 Elsevier Ltd. All rights reserved.
A Library of Optimization Algorithms for Organizational Design
2005-01-01
N00014-98-1-0465 and #N00014-00-1-0101 A Library of Optimization Algorithms for Organizational Design Georgiy M. Levchuk Yuri N. Levchuk Jie Luo...E-mail: Krishna@engr.uconn.edu Abstract This paper presents a library of algorithms to solve a broad range of optimization problems arising in the...normative design of organizations to execute a specific mission. The use of specific optimization algorithms for different phases of the design process
Tear fluid proteomics multimarkers for diabetic retinopathy screening
2013-01-01
Background The aim of the project was to develop a novel method for diabetic retinopathy screening based on the examination of tear fluid biomarker changes. In order to evaluate the usability of protein biomarkers for pre-screening purposes several different approaches were used, including machine learning algorithms. Methods All persons involved in the study had diabetes. Diabetic retinopathy (DR) was diagnosed by capturing 7-field fundus images, evaluated by two independent ophthalmologists. 165 eyes were examined (from 119 patients), 55 were diagnosed healthy and 110 images showed signs of DR. Tear samples were taken from all eyes and state-of-the-art nano-HPLC coupled ESI-MS/MS mass spectrometry protein identification was performed on all samples. Applicability of protein biomarkers was evaluated by six different optimally parameterized machine learning algorithms: Support Vector Machine, Recursive Partitioning, Random Forest, Naive Bayes, Logistic Regression, K-Nearest Neighbor. Results Out of the six investigated machine learning algorithms the result of Recursive Partitioning proved to be the most accurate. The performance of the system realizing the above algorithm reached 74% sensitivity and 48% specificity. Conclusions Protein biomarkers selected and classified with machine learning algorithms alone are at present not recommended for screening purposes because of low specificity and sensitivity values. This tool can be potentially used to improve the results of image processing methods as a complementary tool in automatic or semiautomatic systems. PMID:23919537
Development of a Dependency Theory Toolbox for Database Design.
1987-12-01
published algorithms and theorems , and hand simulating these algorithms can be a tedious and error prone chore. Additionally, since the process of...to design and study relational databases exists in the form of published algorithms and theorems . However, hand simulating these algorithms can be a...published algorithms and theorems . Hand simulating these algorithms can be a tedious and error prone chore. Therefore, a toolbox of algorithms and
AllergenFP: allergenicity prediction by descriptor fingerprints.
Dimitrov, Ivan; Naneva, Lyudmila; Doytchinova, Irini; Bangov, Ivan
2014-03-15
Allergenicity, like antigenicity and immunogenicity, is a property encoded linearly and non-linearly, and therefore the alignment-based approaches are not able to identify this property unambiguously. A novel alignment-free descriptor-based fingerprint approach is presented here and applied to identify allergens and non-allergens. The approach was implemented into a four step algorithm. Initially, the protein sequences are described by amino acid principal properties as hydrophobicity, size, relative abundance, helix and β-strand forming propensities. Then, the generated strings of different length are converted into vectors with equal length by auto- and cross-covariance (ACC). The vectors were transformed into binary fingerprints and compared in terms of Tanimoto coefficient. The approach was applied to a set of 2427 known allergens and 2427 non-allergens and identified correctly 88% of them with Matthews correlation coefficient of 0.759. The descriptor fingerprint approach presented here is universal. It could be applied for any classification problem in computational biology. The set of E-descriptors is able to capture the main structural and physicochemical properties of amino acids building the proteins. The ACC transformation overcomes the main problem in the alignment-based comparative studies arising from the different length of the aligned protein sequences. The conversion of protein ACC values into binary descriptor fingerprints allows similarity search and classification. The algorithm described in the present study was implemented in a specially designed Web site, named AllergenFP (FP stands for FingerPrint). AllergenFP is written in Python, with GIU in HTML. It is freely accessible at http://ddg-pharmfac.net/Allergen FP. idoytchinova@pharmfac.net or ivanbangov@shu-bg.net.
A genetic algorithm for solving supply chain network design model
NASA Astrophysics Data System (ADS)
Firoozi, Z.; Ismail, N.; Ariafar, S. H.; Tang, S. H.; Ariffin, M. K. M. A.
2013-09-01
Network design is by nature costly and optimization models play significant role in reducing the unnecessary cost components of a distribution network. This study proposes a genetic algorithm to solve a distribution network design model. The structure of the chromosome in the proposed algorithm is defined in a novel way that in addition to producing feasible solutions, it also reduces the computational complexity of the algorithm. Computational results are presented to show the algorithm performance.
Goodswen, Stephen J; Kennedy, Paul J; Ellis, John T
2013-11-02
An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets. The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally. Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory.
A structure adapted multipole method for electrostatic interactions in protein dynamics
NASA Astrophysics Data System (ADS)
Niedermeier, Christoph; Tavan, Paul
1994-07-01
We present an algorithm for rapid approximate evaluation of electrostatic interactions in molecular dynamics simulations of proteins. Traditional algorithms require computational work of the order O(N2) for a system of N particles. Truncation methods which try to avoid that effort entail untolerably large errors in forces, energies and other observables. Hierarchical multipole expansion algorithms, which can account for the electrostatics to numerical accuracy, scale with O(N log N) or even with O(N) if they become augmented by a sophisticated scheme for summing up forces. To further reduce the computational effort we propose an algorithm that also uses a hierarchical multipole scheme but considers only the first two multipole moments (i.e., charges and dipoles). Our strategy is based on the consideration that numerical accuracy may not be necessary to reproduce protein dynamics with sufficient correctness. As opposed to previous methods, our scheme for hierarchical decomposition is adjusted to structural and dynamical features of the particular protein considered rather than chosen rigidly as a cubic grid. As compared to truncation methods we manage to reduce errors in the computation of electrostatic forces by a factor of 10 with only marginal additional effort.
2013-01-01
Background An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets. Results The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally. Conclusions Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory. PMID:24180526
NASA Astrophysics Data System (ADS)
Singh, R.; Verma, H. K.
2013-12-01
This paper presents a teaching-learning-based optimization (TLBO) algorithm to solve parameter identification problems in the designing of digital infinite impulse response (IIR) filter. TLBO based filter modelling is applied to calculate the parameters of unknown plant in simulations. Unlike other heuristic search algorithms, TLBO algorithm is an algorithm-specific parameter-less algorithm. In this paper big bang-big crunch (BB-BC) optimization and PSO algorithms are also applied to filter design for comparison. Unknown filter parameters are considered as a vector to be optimized by these algorithms. MATLAB programming is used for implementation of proposed algorithms. Experimental results show that the TLBO is more accurate to estimate the filter parameters than the BB-BC optimization algorithm and has faster convergence rate when compared to PSO algorithm. TLBO is used where accuracy is more essential than the convergence speed.
Recognition of Protein-coding Genes Based on Z-curve Algorithms
-Biao Guo, Feng; Lin, Yan; -Ling Chen, Ling
2014-01-01
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation. PMID:24822027
Performance-Based Seismic Design of Steel Frames Utilizing Colliding Bodies Algorithm
Veladi, H.
2014-01-01
A pushover analysis method based on semirigid connection concept is developed and the colliding bodies optimization algorithm is employed to find optimum seismic design of frame structures. Two numerical examples from the literature are studied. The results of the new algorithm are compared to the conventional design methods to show the power or weakness of the algorithm. PMID:25202717
Performance-based seismic design of steel frames utilizing colliding bodies algorithm.
Veladi, H
2014-01-01
A pushover analysis method based on semirigid connection concept is developed and the colliding bodies optimization algorithm is employed to find optimum seismic design of frame structures. Two numerical examples from the literature are studied. The results of the new algorithm are compared to the conventional design methods to show the power or weakness of the algorithm.
Network-based ranking methods for prediction of novel disease associated microRNAs.
Le, Duc-Hau
2015-10-01
Many studies have shown roles of microRNAs on human disease and a number of computational methods have been proposed to predict such associations by ranking candidate microRNAs according to their relevance to a disease. Among them, machine learning-based methods usually have a limitation in specifying non-disease microRNAs as negative training samples. Meanwhile, network-based methods are becoming dominant since they well exploit a "disease module" principle in microRNA functional similarity networks. Of which, random walk with restart (RWR) algorithm-based method is currently state-of-the-art. The use of this algorithm was inspired from its success in predicting disease gene because the "disease module" principle also exists in protein interaction networks. Besides, many algorithms designed for webpage ranking have been successfully applied in ranking disease candidate genes because web networks share topological properties with protein interaction networks. However, these algorithms have not yet been utilized for disease microRNA prediction. We constructed microRNA functional similarity networks based on shared targets of microRNAs, and then we integrated them with a microRNA functional synergistic network, which was recently identified. After analyzing topological properties of these networks, in addition to RWR, we assessed the performance of (i) PRINCE (PRIoritizatioN and Complex Elucidation), which was proposed for disease gene prediction; (ii) PageRank with Priors (PRP) and K-Step Markov (KSM), which were used for studying web networks; and (iii) a neighborhood-based algorithm. Analyses on topological properties showed that all microRNA functional similarity networks are small-worldness and scale-free. The performance of each algorithm was assessed based on average AUC values on 35 disease phenotypes and average rankings of newly discovered disease microRNAs. As a result, the performance on the integrated network was better than that on individual ones. In addition, the performance of PRINCE, PRP and KSM was comparable with that of RWR, whereas it was worst for the neighborhood-based algorithm. Moreover, all the algorithms were stable with the change of parameters. Final, using the integrated network, we predicted six novel miRNAs (i.e., hsa-miR-101, hsa-miR-181d, hsa-miR-192, hsa-miR-423-3p, hsa-miR-484 and hsa-miR-98) associated with breast cancer. Network-based ranking algorithms, which were successfully applied for either disease gene prediction or for studying social/web networks, can be also used effectively for disease microRNA prediction. Copyright © 2015 Elsevier Ltd. All rights reserved.
Goonesekere, Nalin Cw
2009-01-01
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
Kang, Beom Sik; Pugalendhi, GaneshKumar; Kim, Ku-Jin
2017-10-13
Interactions between protein molecules are essential for the assembly, function, and regulation of proteins. The contact region between two protein molecules in a protein complex is usually complementary in shape for both molecules and the area of the contact region can be used to estimate the binding strength between two molecules. Although the area is a value calculated from the three-dimensional surface, it cannot represent the three-dimensional shape of the surface. Therefore, we propose an original concept of two-dimensional contact area which provides further information such as the ruggedness of the contact region. We present a novel algorithm for calculating the binding direction between two molecules in a protein complex, and then suggest a method to compute the two-dimensional flattened area of the contact region between two molecules based on the binding direction.
A quasi-physical algorithm for the structure optimization in an off-lattice protein model.
Liu, Jing-Fa; Huang, Wen-Qi
2006-02-01
In this paper, we study an off-lattice protein AB model with two species of monomers, hydrophobic and hydrophilic, and present a heuristic quasi-physical algorithm. First, by elaborately simulating the movement of the smooth solids in the physical world, we find low-energy conformations for a given monomer chain. A subsequent off-trap strategy is then proposed to trigger a jump for a stuck situation in order to get out of the local minima. The algorithm has been tested in the three-dimensional AB model for all sequences with lengths of 13-55 monomers. In several cases, we renew the putative ground state energy values. The numerical results show that the proposed methods are very promising for finding the ground states of proteins.
Yousef, Abdulaziz; Moghadam Charkari, Nasrollah
2013-11-07
Protein-Protein interaction (PPI) is one of the most important data in understanding the cellular processes. Many interesting methods have been proposed in order to predict PPIs. However, the methods which are based on the sequence of proteins as a prior knowledge are more universal. In this paper, a sequence-based, fast, and adaptive PPI prediction method is introduced to assign two proteins to an interaction class (yes, no). First, in order to improve the presentation of the sequences, twelve physicochemical properties of amino acid have been used by different representation methods to transform the sequence of protein pairs into different feature vectors. Then, for speeding up the learning process and reducing the effect of noise PPI data, principal component analysis (PCA) is carried out as a proper feature extraction algorithm. Finally, a new and adaptive Learning Vector Quantization (LVQ) predictor is designed to deal with different models of datasets that are classified into balanced and imbalanced datasets. The accuracy of 93.88%, 90.03%, and 89.72% has been found on S. cerevisiae, H. pylori, and independent datasets, respectively. The results of various experiments indicate the efficiency and validity of the method. © 2013 Published by Elsevier Ltd.
AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures
Zambrano, Rafael; Jamroz, Michal; Szczasiuk, Agata; Pujols, Jordi; Kmiecik, Sebastian; Ventura, Salvador
2015-01-01
Protein aggregation underlies an increasing number of disorders and constitutes a major bottleneck in the development of therapeutic proteins. Our present understanding on the molecular determinants of protein aggregation has crystalized in a series of predictive algorithms to identify aggregation-prone sites. A majority of these methods rely only on sequence. Therefore, they find difficulties to predict the aggregation properties of folded globular proteins, where aggregation-prone sites are often not contiguous in sequence or buried inside the native structure. The AGGRESCAN3D (A3D) server overcomes these limitations by taking into account the protein structure and the experimental aggregation propensity scale from the well-established AGGRESCAN method. Using the A3D server, the identified aggregation-prone residues can be virtually mutated to design variants with increased solubility, or to test the impact of pathogenic mutations. Additionally, A3D server enables to take into account the dynamic fluctuations of protein structure in solution, which may influence aggregation propensity. This is possible in A3D Dynamic Mode that exploits the CABS-flex approach for the fast simulations of flexibility of globular proteins. The A3D server can be accessed at http://biocomp.chem.uw.edu.pl/A3D/. PMID:25883144
A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts
Egorov, Sergei; Yuryev, Anton; Daraselia, Nikolai
2004-01-01
Objective: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora. Design: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline “Name-of-Substance” (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step. Measurements: The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts. Results: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively. Conclusion: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed. PMID:14764613
Predicting cancer-relevant proteins using an improved molecular similarity ensemble approach.
Zhou, Bin; Sun, Qi; Kong, De-Xin
2016-05-31
In this study, we proposed an improved algorithm for identifying proteins relevant to cancer. The algorithm was named two-layer molecular similarity ensemble approach (TL-SEA). We applied TL-SEA to analyzing the correlation between anticancer compounds (against cell lines K562, MCF7 and A549) and active compounds against separate target proteins listed in BindingDB. Several associations between cancer types and related proteins were revealed using this chemoinformatics approach. An analysis of the literature showed that 26 of 35 predicted proteins were correlated with cancer cell proliferation, apoptosis or differentiation. Additionally, interactions between proteins in BindingDB and anticancer chemicals were also predicted. We discuss the roles of the most important predicted proteins in cancer biology and conclude that TL-SEA could be a useful tool for inferring novel proteins involved in cancer and revealing underlying molecular mechanisms.
Comparative study of classification algorithms for immunosignaturing data
2012-01-01
Background High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data. Results We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found ‘Naïve Bayes’ far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy. Conclusions ‘Naïve Bayes’ algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties. PMID:22720696
Algorithmic Coordination in Robotic Networks
2010-11-29
appropriate performance, robustness and scalability properties for various task allocation , surveillance, and information gathering applications is...networking, we envision designing and analyzing algorithms with appropriate performance, robustness and scalability properties for various task ...distributed algorithms for target assignments; based on the classic auction algorithms in static networks, we intend to design efficient algorithms in worst
Analysis of X-ray structures of matrix metalloproteinases via chaotic map clustering.
Giangreco, Ilenia; Nicolotti, Orazio; Carotti, Angelo; De Carlo, Francesco; Gargano, Gianfranco; Bellotti, Roberto
2010-10-08
Matrix metalloproteinases (MMPs) are well-known biological targets implicated in tumour progression, homeostatic regulation, innate immunity, impaired delivery of pro-apoptotic ligands, and the release and cleavage of cell-surface receptors. With this in mind, the perception of the intimate relationships among diverse MMPs could be a solid basis for accelerated learning in designing new selective MMP inhibitors. In this regard, decrypting the latent molecular reasons in order to elucidate similarity among MMPs is a key challenge. We describe a pairwise variant of the non-parametric chaotic map clustering (CMC) algorithm and its application to 104 X-ray MMP structures. In this analysis electrostatic potentials are computed and used as input for the CMC algorithm. It was shown that differences between proteins reflect genuine variation of their electrostatic potentials. In addition, the analysis has been also extended to analyze the protein primary structures and the molecular shapes of the MMP co-crystallised ligands. The CMC algorithm was shown to be a valuable tool in knowledge acquisition and transfer from MMP structures. Based on the variation of electrostatic potentials, CMC was successful in analysing the MMP target family landscape and different subsites. The first investigation resulted in rational figure interpretation of both domain organization as well as of substrate specificity classifications. The second made it possible to distinguish the MMP classes, demonstrating the high specificity of the S1' pocket, to detect both the occurrence of punctual mutations of ionisable residues and different side-chain conformations that likely account for induced-fit phenomena. In addition, CMC demonstrated a potential comparable to the most popular UPGMA (Unweighted Pair Group Method with Arithmetic mean) method that, at present, represents a standard clustering bioinformatics approach. Interestingly, CMC and UPGMA resulted in closely comparable outcomes, but often CMC produced more informative and more easy interpretable dendrograms. Finally, CMC was successful for standard pairwise analysis (i.e., Smith-Waterman algorithm) of protein sequences and was used to convincingly explain the complementarity existing between the molecular shapes of the co-crystallised ligand molecules and the accessible MMP void volumes.
Sequence information signal processor for local and global string comparisons
Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.
1997-01-01
A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.
SABER: a computational method for identifying active sites for new reactions.
Nosrati, Geoffrey R; Houk, K N
2012-05-01
A software suite, SABER (Selection of Active/Binding sites for Enzyme Redesign), has been developed for the analysis of atomic geometries in protein structures, using a geometric hashing algorithm (Barker and Thornton, Bioinformatics 2003;19:1644-1649). SABER is used to explore the Protein Data Bank (PDB) to locate proteins with a specific 3D arrangement of catalytic groups to identify active sites that might be redesigned to catalyze new reactions. As a proof-of-principle test, SABER was used to identify enzymes that have the same catalytic group arrangement present in o-succinyl benzoate synthase (OSBS). Among the highest-scoring scaffolds identified by the SABER search for enzymes with the same catalytic group arrangement as OSBS were L-Ala D/L-Glu epimerase (AEE) and muconate lactonizing enzyme II (MLE), both of which have been redesigned to become effective OSBS catalysts, demonstrated by experiments. Next, we used SABER to search for naturally existing active sites in the PDB with catalytic groups similar to those present in the designed Kemp elimination enzyme KE07. From over 2000 geometric matches to the KE07 active site, SABER identified 23 matches that corresponded to residues from known active sites. The best of these matches, with a 0.28 Å catalytic atom RMSD to KE07, was then redesigned to be compatible with the Kemp elimination using RosettaDesign. We also used SABER to search for potential Kemp eliminases using a theozyme predicted to provide a greater rate acceleration than the active site of KE07, and used Rosetta to create a design based on the proteins identified. Copyright © 2012 The Protein Society.
Rational Design of Orthogonal Multipolar Interactions with Fluorine in Protein–Ligand Complexes
Pollock, Jonathan; Borkin, Dmitry; Lund, George; ...
2015-08-19
Multipolar interactions involving fluorine and the protein backbone have been frequently observed in protein–ligand complexes. Such fluorine–backbone interactions may substantially contribute to the high affinity of small molecule inhibitors. Here we found that introduction of trifluoromethyl groups into two different sites in the thienopyrimidine class of menin–MLL inhibitors considerably improved their inhibitory activity. In both cases, trifluoromethyl groups are engaged in short interactions with the backbone of menin. In order to understand the effect of fluorine, we synthesized a series of analogues by systematically changing the number of fluorine atoms, and we determined high-resolution crystal structures of the complexes withmore » menin. Here, we found that introduction of fluorine at favorable geometry for interactions with backbone carbonyls may improve the activity of menin–MLL inhibitors as much as 5- to 10-fold. In order to facilitate the design of multipolar fluorine–backbone interactions in protein–ligand complexes, we developed a computational algorithm named FMAP, which calculates fluorophilic sites in proximity to the protein backbone. We demonstrated that FMAP could be used to rationalize improvement in the activity of known protein inhibitors upon introduction of fluorine. Furthermore, FMAP may also represent a valuable tool for designing new fluorine substitutions and support ligand optimization in drug discovery projects. Analysis of the menin–MLL inhibitor complexes revealed that the backbone in secondary structures is particularly accessible to the interactions with fluorine. Lastly, considering that secondary structure elements are frequently exposed at protein interfaces, we postulate that multipolar fluorine–backbone interactions may represent a particularly attractive approach to improve inhibitors of protein–protein interactions.« less
Protein-ligand docking using fitness learning-based artificial bee colony with proximity stimuli.
Uehara, Shota; Fujimoto, Kazuhiro J; Tanaka, Shigenori
2015-07-07
Protein-ligand docking is an optimization problem, which aims to identify the binding pose of a ligand with the lowest energy in the active site of a target protein. In this study, we employed a novel optimization algorithm called fitness learning-based artificial bee colony with proximity stimuli (FlABCps) for docking. Simulation results revealed that FlABCps improved the success rate of docking, compared to four state-of-the-art algorithms. The present results also showed superior docking performance of FlABCps, in particular for dealing with highly flexible ligands and proteins with a wide and shallow binding pocket.
Site-directed protein recombination as a shortest-path problem.
Endelman, Jeffrey B; Silberg, Jonathan J; Wang, Zhen-Gang; Arnold, Frances H
2004-07-01
Protein function can be tuned using laboratory evolution, in which one rapidly searches through a library of proteins for the properties of interest. In site-directed recombination, n crossovers are chosen in an alignment of p parents to define a set of p(n + 1) peptide fragments. These fragments are then assembled combinatorially to create a library of p(n+1) proteins. We have developed a computational algorithm to enrich these libraries in folded proteins while maintaining an appropriate level of diversity for evolution. For a given set of parents, our algorithm selects crossovers that minimize the average energy of the library, subject to constraints on the length of each fragment. This problem is equivalent to finding the shortest path between nodes in a network, for which the global minimum can be found efficiently. Our algorithm has a running time of O(N(3)p(2) + N(2)n) for a protein of length N. Adjusting the constraints on fragment length generates a set of optimized libraries with varying degrees of diversity. By comparing these optima for different sets of parents, we rapidly determine which parents yield the lowest energy libraries.
DASS: efficient discovery and p-value calculation of substructures in unordered data.
Hollunder, Jens; Friedel, Maik; Beyer, Andreas; Workman, Christopher T; Wilhelm, Thomas
2007-01-01
Pattern identification in biological sequence data is one of the main objectives of bioinformatics research. However, few methods are available for detecting patterns (substructures) in unordered datasets. Data mining algorithms mainly developed outside the realm of bioinformatics have been adapted for that purpose, but typically do not determine the statistical significance of the identified patterns. Moreover, these algorithms do not exploit the often modular structure of biological data. We present the algorithm DASS (Discovery of All Significant Substructures) that first identifies all substructures in unordered data (DASS(Sub)) in a manner that is especially efficient for modular data. In addition, DASS calculates the statistical significance of the identified substructures, for sets with at most one element of each type (DASS(P(set))), or for sets with multiple occurrence of elements (DASS(P(mset))). The power and versatility of DASS is demonstrated by four examples: combinations of protein domains in multi-domain proteins, combinations of proteins in protein complexes (protein subcomplexes), combinations of transcription factor target sites in promoter regions and evolutionarily conserved protein interaction subnetworks. The program code and additional data are available at http://www.fli-leibniz.de/tsb/DASS
Lee, Juyong; Lee, Jinhyuk; Sasaki, Takeshi N; Sasai, Masaki; Seok, Chaok; Lee, Jooyoung
2011-08-01
Ab initio protein structure prediction is a challenging problem that requires both an accurate energetic representation of a protein structure and an efficient conformational sampling method for successful protein modeling. In this article, we present an ab initio structure prediction method which combines a recently suggested novel way of fragment assembly, dynamic fragment assembly (DFA) and conformational space annealing (CSA) algorithm. In DFA, model structures are scored by continuous functions constructed based on short- and long-range structural restraint information from a fragment library. Here, DFA is represented by the full-atom model by CHARMM with the addition of the empirical potential of DFIRE. The relative contributions between various energy terms are optimized using linear programming. The conformational sampling was carried out with CSA algorithm, which can find low energy conformations more efficiently than simulated annealing used in the existing DFA study. The newly introduced DFA energy function and CSA sampling algorithm are implemented into CHARMM. Test results on 30 small single-domain proteins and 13 template-free modeling targets of the 8th Critical Assessment of protein Structure Prediction show that the current method provides comparable and complementary prediction results to existing top methods. Copyright © 2011 Wiley-Liss, Inc.
Yan, Yumeng; Tao, Huanyu; Huang, Sheng-You
2018-05-26
A major subclass of protein-protein interactions is formed by homo-oligomers with certain symmetry. Therefore, computational modeling of the symmetric protein complexes is important for understanding the molecular mechanism of related biological processes. Although several symmetric docking algorithms have been developed for Cn symmetry, few docking servers have been proposed for Dn symmetry. Here, we present HSYMDOCK, a web server of our hierarchical symmetric docking algorithm that supports both Cn and Dn symmetry. The HSYMDOCK server was extensively evaluated on three benchmarks of symmetric protein complexes, including the 20 CASP11-CAPRI30 homo-oligomer targets, the symmetric docking benchmark of 213 Cn targets and 35 Dn targets, and a nonredundant test set of 55 transmembrane proteins. It was shown that HSYMDOCK obtained a significantly better performance than other similar docking algorithms. The server supports both sequence and structure inputs for the monomer/subunit. Users have an option to provide the symmetry type of the complex, or the server can predict the symmetry type automatically. The docking process is fast and on average consumes 10∼20 min for a docking job. The HSYMDOCK web server is available at http://huanglab.phys.hust.edu.cn/hsymdock/.
HEURISTIC OPTIMIZATION AND ALGORITHM TUNING APPLIED TO SORPTIVE BARRIER DESIGN
While heuristic optimization is applied in environmental applications, ad-hoc algorithm configuration is typical. We use a multi-layer sorptive barrier design problem as a benchmark for an algorithm-tuning procedure, as applied to three heuristics (genetic algorithms, simulated ...
System Design under Uncertainty: Evolutionary Optimization of the Gravity Probe-B Spacecraft
NASA Technical Reports Server (NTRS)
Pullen, Samuel P.; Parkinson, Bradford W.
1994-01-01
This paper discusses the application of evolutionary random-search algorithms (Simulated Annealing and Genetic Algorithms) to the problem of spacecraft design under performance uncertainty. Traditionally, spacecraft performance uncertainty has been measured by reliability. Published algorithms for reliability optimization are seldom used in practice because they oversimplify reality. The algorithm developed here uses random-search optimization to allow us to model the problem more realistically. Monte Carlo simulations are used to evaluate the objective function for each trial design solution. These methods have been applied to the Gravity Probe-B (GP-B) spacecraft being developed at Stanford University for launch in 1999, Results of the algorithm developed here for GP-13 are shown, and their implications for design optimization by evolutionary algorithms are discussed.
Algorithme intelligent d'optimisation d'un design structurel de grande envergure
NASA Astrophysics Data System (ADS)
Dominique, Stephane
The implementation of an automated decision support system in the field of design and structural optimisation can give a significant advantage to any industry working on mechanical designs. Indeed, by providing solution ideas to a designer or by upgrading existing design solutions while the designer is not at work, the system may reduce the project cycle time, or allow more time to produce a better design. This thesis presents a new approach to automate a design process based on Case-Based Reasoning (CBR), in combination with a new genetic algorithm named Genetic Algorithm with Territorial core Evolution (GATE). This approach was developed in order to reduce the operating cost of the process. However, as the system implementation cost is quite expensive, the approach is better suited for large scale design problem, and particularly for design problems that the designer plans to solve for many different specification sets. First, the CBR process uses a databank filled with every known solution to similar design problems. Then, the closest solutions to the current problem in term of specifications are selected. After this, during the adaptation phase, an artificial neural network (ANN) interpolates amongst known solutions to produce an additional solution to the current problem using the current specifications as inputs. Each solution produced and selected by the CBR is then used to initialize the population of an island of the genetic algorithm. The algorithm will optimise the solution further during the refinement phase. Using progressive refinement, the algorithm starts using only the most important variables for the problem. Then, as the optimisation progress, the remaining variables are gradually introduced, layer by layer. The genetic algorithm that is used is a new algorithm specifically created during this thesis to solve optimisation problems from the field of mechanical device structural design. The algorithm is named GATE, and is essentially a real number genetic algorithm that prevents new individuals to be born too close to previously evaluated solutions. The restricted area becomes smaller or larger during the optimisation to allow global or local search when necessary. Also, a new search operator named Substitution Operator is incorporated in GATE. This operator allows an ANN surrogate model to guide the algorithm toward the most promising areas of the design space. The suggested CBR approach and GATE were tested on several simple test problems, as well as on the industrial problem of designing a gas turbine engine rotor's disc. These results are compared to other results obtained for the same problems by many other popular optimisation algorithms, such as (depending of the problem) gradient algorithms, binary genetic algorithm, real number genetic algorithm, genetic algorithm using multiple parents crossovers, differential evolution genetic algorithm, Hookes & Jeeves generalized pattern search method and POINTER from the software I-SIGHT 3.5. Results show that GATE is quite competitive, giving the best results for 5 of the 6 constrained optimisation problem. GATE also provided the best results of all on problem produced by a Maximum Set Gaussian landscape generator. Finally, GATE provided a disc 4.3% lighter than the best other tested algorithm (POINTER) for the gas turbine engine rotor's disc problem. One drawback of GATE is a lesser efficiency for highly multimodal unconstrained problems, for which he gave quite poor results with respect to its implementation cost. To conclude, according to the preliminary results obtained during this thesis, the suggested CBR process, combined with GATE, seems to be a very good candidate to automate and accelerate the structural design of mechanical devices, potentially reducing significantly the cost of industrial preliminary design processes.
More reliable protein NMR peak assignment via improved 2-interval scheduling.
Chen, Zhi-Zhong; Lin, Guohui; Rizzi, Romeo; Wen, Jianjun; Xu, Dong; Xu, Ying; Jiang, Tao
2005-03-01
Protein NMR peak assignment refers to the process of assigning a group of "spin systems" obtained experimentally to a protein sequence of amino acids. The automation of this process is still an unsolved and challenging problem in NMR protein structure determination. Recently, protein NMR peak assignment has been formulated as an interval scheduling problem (ISP), where a protein sequence P of amino acids is viewed as a discrete time interval I (the amino acids on P one-to-one correspond to the time units of I), each subset S of spin systems that are known to originate from consecutive amino acids from P is viewed as a "job" j(s), the preference of assigning S to a subsequence P of consecutive amino acids on P is viewed as the profit of executing job j(s) in the subinterval of I corresponding to P, and the goal is to maximize the total profit of executing the jobs (on a single machine) during I. The interval scheduling problem is max SNP-hard in general; but in the real practice of protein NMR peak assignment, each job j(s) usually requires at most 10 consecutive time units, and typically the jobs that require one or two consecutive time units are the most difficult to assign/schedule. In order to solve these most difficult assignments, we present an efficient 13/7-approximation algorithm for the special case of the interval scheduling problem where each job takes one or two consecutive time units. Combining this algorithm with a greedy filtering strategy for handling long jobs (i.e., jobs that need more than two consecutive time units), we obtain a new efficient heuristic for protein NMR peak assignment. Our experimental study shows that the new heuristic produces the best peak assignment in most of the cases, compared with the NMR peak assignment algorithms in the recent literature. The above algorithm is also the first approximation algorithm for a nontrivial case of the well-known interval scheduling problem that breaks the ratio 2 barrier.
A fast parallel clustering algorithm for molecular simulation trajectories.
Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui
2013-01-15
We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.
A Novel Latin Hypercube Algorithm via Translational Propagation
Pan, Guang; Ye, Pengcheng
2014-01-01
Metamodels have been widely used in engineering design to facilitate analysis and optimization of complex systems that involve computationally expensive simulation programs. The accuracy of metamodels is directly related to the experimental designs used. Optimal Latin hypercube designs are frequently used and have been shown to have good space-filling and projective properties. However, the high cost in constructing them limits their use. In this paper, a methodology for creating novel Latin hypercube designs via translational propagation and successive local enumeration algorithm (TPSLE) is developed without using formal optimization. TPSLE algorithm is based on the inspiration that a near optimal Latin Hypercube design can be constructed by a simple initial block with a few points generated by algorithm SLE as a building block. In fact, TPSLE algorithm offers a balanced trade-off between the efficiency and sampling performance. The proposed algorithm is compared to two existing algorithms and is found to be much more efficient in terms of the computation time and has acceptable space-filling and projective properties. PMID:25276844
Li, Min; Li, Qi; Ganegoda, Gamage Upeksha; Wang, JianXin; Wu, FangXiang; Pan, Yi
2014-11-01
Identification of disease-causing genes among a large number of candidates is a fundamental challenge in human disease studies. However, it is still time-consuming and laborious to determine the real disease-causing genes by biological experiments. With the advances of the high-throughput techniques, a large number of protein-protein interactions have been produced. Therefore, to address this issue, several methods based on protein interaction network have been proposed. In this paper, we propose a shortest path-based algorithm, named SPranker, to prioritize disease-causing genes in protein interaction networks. Considering the fact that diseases with similar phenotypes are generally caused by functionally related genes, we further propose an improved algorithm SPGOranker by integrating the semantic similarity of GO annotations. SPGOranker not only considers the topological similarity between protein pairs in a protein interaction network but also takes their functional similarity into account. The proposed algorithms SPranker and SPGOranker were applied to 1598 known orphan disease-causing genes from 172 orphan diseases and compared with three state-of-the-art approaches, ICN, VS and RWR. The experimental results show that SPranker and SPGOranker outperform ICN, VS, and RWR for the prioritization of orphan disease-causing genes. Importantly, for the case study of severe combined immunodeficiency, SPranker and SPGOranker predict several novel causal genes.
Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.
Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai
2015-12-01
The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences. Copyright © 2015 Elsevier Ltd. All rights reserved.
Conservation of hot regions in protein-protein interaction in evolution.
Hu, Jing; Li, Jiarui; Chen, Nansheng; Zhang, Xiaolong
2016-11-01
The hot regions of protein-protein interactions refer to the active area which formed by those most important residues to protein combination process. With the research development on protein interactions, lots of predicted hot regions can be discovered efficiently by intelligent computing methods, while performing biology experiments to verify each every prediction is hardly to be done due to the time-cost and the complexity of the experiment. This study based on the research of hot spot residue conservations, the proposed method is used to verify authenticity of predicted hot regions that using machine learning algorithm combined with protein's biological features and sequence conservation, though multiple sequence alignment, module substitute matrix and sequence similarity to create conservation scoring algorithm, and then using threshold module to verify the conservation tendency of hot regions in evolution. This research work gives an effective method to verify predicted hot regions in protein-protein interactions, which also provides a useful way to deeply investigate the functional activities of protein hot regions. Copyright © 2016. Published by Elsevier Inc.
Evolutionary Multiobjective Design Targeting a Field Programmable Transistor Array
NASA Technical Reports Server (NTRS)
Aguirre, Arturo Hernandez; Zebulum, Ricardo S.; Coello, Carlos Coello
2004-01-01
This paper introduces the ISPAES algorithm for circuit design targeting a Field Programmable Transistor Array (FPTA). The use of evolutionary algorithms is common in circuit design problems, where a single fitness function drives the evolution process. Frequently, the design problem is subject to several goals or operating constraints, thus, designing a suitable fitness function catching all requirements becomes an issue. Such a problem is amenable for multi-objective optimization, however, evolutionary algorithms lack an inherent mechanism for constraint handling. This paper introduces ISPAES, an evolutionary optimization algorithm enhanced with a constraint handling technique. Several design problems targeting a FPTA show the potential of our approach.
Boiler-turbine control system design using a genetic algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dimeo, R.; Lee, K.Y.
1995-12-01
This paper discusses the application of a genetic algorithm to control system design for a boiler-turbine plant. In particular the authors study the ability of the genetic algorithm to develop a proportional-integral (PI) controller and a state feedback controller for a non-linear multi-input/multi-output (MIMO) plant model. The plant model is presented along with a discussion of the inherent difficulties in such controller development. A sketch of the genetic algorithm (GA) is presented and its strategy as a method of control system design is discussed. Results are presented for two different control systems that have been designed with the genetic algorithm.
A Modified Particle Swarm Optimization Technique for Finding Optimal Designs for Mixture Models
Wong, Weng Kee; Chen, Ray-Bing; Huang, Chien-Chih; Wang, Weichung
2015-01-01
Particle Swarm Optimization (PSO) is a meta-heuristic algorithm that has been shown to be successful in solving a wide variety of real and complicated optimization problems in engineering and computer science. This paper introduces a projection based PSO technique, named ProjPSO, to efficiently find different types of optimal designs, or nearly optimal designs, for mixture models with and without constraints on the components, and also for related models, like the log contrast models. We also compare the modified PSO performance with Fedorov's algorithm, a popular algorithm used to generate optimal designs, Cocktail algorithm, and the recent algorithm proposed by [1]. PMID:26091237
Smelter, Andrey; Rouchka, Eric C; Moseley, Hunter N B
2017-08-01
Peak lists derived from nuclear magnetic resonance (NMR) spectra are commonly used as input data for a variety of computer assisted and automated analyses. These include automated protein resonance assignment and protein structure calculation software tools. Prior to these analyses, peak lists must be aligned to each other and sets of related peaks must be grouped based on common chemical shift dimensions. Even when programs can perform peak grouping, they require the user to provide uniform match tolerances or use default values. However, peak grouping is further complicated by multiple sources of variance in peak position limiting the effectiveness of grouping methods that utilize uniform match tolerances. In addition, no method currently exists for deriving peak positional variances from single peak lists for grouping peaks into spin systems, i.e. spin system grouping within a single peak list. Therefore, we developed a complementary pair of peak list registration analysis and spin system grouping algorithms designed to overcome these limitations. We have implemented these algorithms into an approach that can identify multiple dimension-specific positional variances that exist in a single peak list and group peaks from a single peak list into spin systems. The resulting software tools generate a variety of useful statistics on both a single peak list and pairwise peak list alignment, especially for quality assessment of peak list datasets. We used a range of low and high quality experimental solution NMR and solid-state NMR peak lists to assess performance of our registration analysis and grouping algorithms. Analyses show that an algorithm using a single iteration and uniform match tolerances approach is only able to recover from 50 to 80% of the spin systems due to the presence of multiple sources of variance. Our algorithm recovers additional spin systems by reevaluating match tolerances in multiple iterations. To facilitate evaluation of the algorithms, we developed a peak list simulator within our nmrstarlib package that generates user-defined assigned peak lists from a given BMRB entry or database of entries. In addition, over 100,000 simulated peak lists with one or two sources of variance were generated to evaluate the performance and robustness of these new registration analysis and peak grouping algorithms.
Güssregen, Stefan; Matter, Hans; Hessler, Gerhard; Lionta, Evanthia; Heil, Jochen; Kast, Stefan M
2017-07-24
Water molecules play an essential role for mediating interactions between ligands and protein binding sites. Displacement of specific water molecules can favorably modulate the free energy of binding of protein-ligand complexes. Here, the nature of water interactions in protein binding sites is investigated by 3D RISM (three-dimensional reference interaction site model) integral equation theory to understand and exploit local thermodynamic features of water molecules by ranking their possible displacement in structure-based design. Unlike molecular dynamics-based approaches, 3D RISM theory allows for fast and noise-free calculations using the same detailed level of solute-solvent interaction description. Here we correlate molecular water entities instead of mere site density maxima with local contributions to the solvation free energy using novel algorithms. Distinct water molecules and hydration sites are investigated in multiple protein-ligand X-ray structures, namely streptavidin, factor Xa, and factor VIIa, based on 3D RISM-derived free energy density fields. Our approach allows the semiquantitative assessment of whether a given structural water molecule can potentially be targeted for replacement in structure-based design. Finally, PLS-based regression models from free energy density fields used within a 3D-QSAR approach (CARMa - comparative analysis of 3D RISM Maps) are shown to be able to extract relevant information for the interpretation of structure-activity relationship (SAR) trends, as demonstrated for a series of serine protease inhibitors.
An automated method for finding molecular complexes in large protein interaction networks
Bader, Gary D; Hogue, Christopher WV
2003-01-01
Background Recent advances in proteomics technologies such as two-hybrid, phage display and mass spectrometry have enabled us to create a detailed map of biomolecular interaction networks. Initial mapping efforts have already produced a wealth of data. As the size of the interaction set increases, databases and computational methods will be required to store, visualize and analyze the information in order to effectively aid in knowledge discovery. Results This paper describes a novel graph theoretic clustering algorithm, "Molecular Complex Detection" (MCODE), that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes. The method is based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. The algorithm has the advantage over other graph clustering methods of having a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network and allows examination of cluster interconnectivity, which is relevant for protein networks. Protein interaction and complex information from the yeast Saccharomyces cerevisiae was used for evaluation. Conclusion Dense regions of protein interaction networks can be found, based solely on connectivity data, many of which correspond to known protein complexes. The algorithm is not affected by a known high rate of false positives in data from high-throughput interaction techniques. The program is available from . PMID:12525261
Description and control of dissociation channels in gas-phase protein complexes
NASA Astrophysics Data System (ADS)
Thachuk, Mark; Fegan, Sarah K.; Raheem, Nigare
2016-08-01
Using molecular dynamics simulations of a coarse-grained model of the charged apo-hemoglobin protein complex, this work expands upon our initial report [S. K. Fegan and M. Thachuk, J. Am. Soc. Mass Spectrom. 25, 722-728 (2014)] about control of dissociation channels in the gas phase using specially designed charge tags. Employing a charge hopping algorithm and a range of temperatures, a variety of dissociation channels are found for activated gas-phase protein complexes. At low temperatures, a single monomer unfolds and becomes charge enriched. At higher temperatures, two additional channels open: (i) two monomers unfold and charge enrich and (ii) two monomers compete for unfolding with one eventually dominating and the other reattaching to the complex. At even higher temperatures, other more complex dissociation channels open with three or more monomers competing for unfolding. A model charge tag with five sites is specially designed to either attract or exclude charges. By attaching this tag to the N-terminus of specific monomers, the unfolding of those monomers can be decidedly enhanced or suppressed. In other words, using charge tags to direct the motion of charges in a protein complex provides a mechanism for controlling dissociation. This technique could be used in mass spectrometry experiments to direct forces at specific attachment points in a protein complex, and hence increase the diversity of product channels available for quantitative analysis. In turn, this could provide insight into the function of the protein complex in its native biological environment. From a dynamics perspective, this system provides an interesting example of cooperative behaviour involving motions with differing time scales.
Defining an essence of structure determining residue contacts in proteins.
Sathyapriya, R; Duarte, Jose M; Stehr, Henning; Filippis, Ioannis; Lappe, Michael
2009-12-01
The network of native non-covalent residue contacts determines the three-dimensional structure of a protein. However, not all contacts are of equal structural significance, and little knowledge exists about a minimal, yet sufficient, subset required to define the global features of a protein. Characterisation of this "structural essence" has remained elusive so far: no algorithmic strategy has been devised to-date that could outperform a random selection in terms of 3D reconstruction accuracy (measured as the Ca RMSD). It is not only of theoretical interest (i.e., for design of advanced statistical potentials) to identify the number and nature of essential native contacts-such a subset of spatial constraints is very useful in a number of novel experimental methods (like EPR) which rely heavily on constraint-based protein modelling. To derive accurate three-dimensional models from distance constraints, we implemented a reconstruction pipeline using distance geometry. We selected a test-set of 12 protein structures from the four major SCOP fold classes and performed our reconstruction analysis. As a reference set, series of random subsets (ranging from 10% to 90% of native contacts) are generated for each protein, and the reconstruction accuracy is computed for each subset. We have developed a rational strategy, termed "cone-peeling" that combines sequence features and network descriptors to select minimal subsets that outperform the reference sets. We present, for the first time, a rational strategy to derive a structural essence of residue contacts and provide an estimate of the size of this minimal subset. Our algorithm computes sparse subsets capable of determining the tertiary structure at approximately 4.8 A Ca RMSD with as little as 8% of the native contacts (Ca-Ca and Cb-Cb). At the same time, a randomly chosen subset of native contacts needs about twice as many contacts to reach the same level of accuracy. This "structural essence" opens new avenues in the fields of structure prediction, empirical potentials and docking.
Defining an Essence of Structure Determining Residue Contacts in Proteins
Sathyapriya, R.; Duarte, Jose M.; Stehr, Henning; Filippis, Ioannis; Lappe, Michael
2009-01-01
The network of native non-covalent residue contacts determines the three-dimensional structure of a protein. However, not all contacts are of equal structural significance, and little knowledge exists about a minimal, yet sufficient, subset required to define the global features of a protein. Characterisation of this “structural essence” has remained elusive so far: no algorithmic strategy has been devised to-date that could outperform a random selection in terms of 3D reconstruction accuracy (measured as the Ca RMSD). It is not only of theoretical interest (i.e., for design of advanced statistical potentials) to identify the number and nature of essential native contacts—such a subset of spatial constraints is very useful in a number of novel experimental methods (like EPR) which rely heavily on constraint-based protein modelling. To derive accurate three-dimensional models from distance constraints, we implemented a reconstruction pipeline using distance geometry. We selected a test-set of 12 protein structures from the four major SCOP fold classes and performed our reconstruction analysis. As a reference set, series of random subsets (ranging from 10% to 90% of native contacts) are generated for each protein, and the reconstruction accuracy is computed for each subset. We have developed a rational strategy, termed “cone-peeling” that combines sequence features and network descriptors to select minimal subsets that outperform the reference sets. We present, for the first time, a rational strategy to derive a structural essence of residue contacts and provide an estimate of the size of this minimal subset. Our algorithm computes sparse subsets capable of determining the tertiary structure at approximately 4.8 Å Ca RMSD with as little as 8% of the native contacts (Ca-Ca and Cb-Cb). At the same time, a randomly chosen subset of native contacts needs about twice as many contacts to reach the same level of accuracy. This “structural essence” opens new avenues in the fields of structure prediction, empirical potentials and docking. PMID:19997489
A real-time MTFC algorithm of space remote-sensing camera based on FPGA
NASA Astrophysics Data System (ADS)
Zhao, Liting; Huang, Gang; Lin, Zhe
2018-01-01
A real-time MTFC algorithm of space remote-sensing camera based on FPGA was designed. The algorithm can provide real-time image processing to enhance image clarity when the remote-sensing camera running on-orbit. The image restoration algorithm adopted modular design. The MTF measurement calculation module on-orbit had the function of calculating the edge extension function, line extension function, ESF difference operation, normalization MTF and MTFC parameters. The MTFC image filtering and noise suppression had the function of filtering algorithm and effectively suppressing the noise. The algorithm used System Generator to design the image processing algorithms to simplify the design structure of system and the process redesign. The image gray gradient dot sharpness edge contrast and median-high frequency were enhanced. The image SNR after recovery reduced less than 1 dB compared to the original image. The image restoration system can be widely used in various fields.
SABER: A computational method for identifying active sites for new reactions
Nosrati, Geoffrey R; Houk, K N
2012-01-01
A software suite, SABER (Selection of Active/Binding sites for Enzyme Redesign), has been developed for the analysis of atomic geometries in protein structures, using a geometric hashing algorithm (Barker and Thornton, Bioinformatics 2003;19:1644–1649). SABER is used to explore the Protein Data Bank (PDB) to locate proteins with a specific 3D arrangement of catalytic groups to identify active sites that might be redesigned to catalyze new reactions. As a proof-of-principle test, SABER was used to identify enzymes that have the same catalytic group arrangement present in o-succinyl benzoate synthase (OSBS). Among the highest-scoring scaffolds identified by the SABER search for enzymes with the same catalytic group arrangement as OSBS were l-Ala d/l-Glu epimerase (AEE) and muconate lactonizing enzyme II (MLE), both of which have been redesigned to become effective OSBS catalysts, demonstrated by experiments. Next, we used SABER to search for naturally existing active sites in the PDB with catalytic groups similar to those present in the designed Kemp elimination enzyme KE07. From over 2000 geometric matches to the KE07 active site, SABER identified 23 matches that corresponded to residues from known active sites. The best of these matches, with a 0.28 Å catalytic atom RMSD to KE07, was then redesigned to be compatible with the Kemp elimination using RosettaDesign. We also used SABER to search for potential Kemp eliminases using a theozyme predicted to provide a greater rate acceleration than the active site of KE07, and used Rosetta to create a design based on the proteins identified. PMID:22492397
Zemla, Adam T; Lang, Dorothy M; Kostova, Tanya; Andino, Raul; Ecale Zhou, Carol L
2011-06-02
Most of the currently used methods for protein function prediction rely on sequence-based comparisons between a query protein and those for which a functional annotation is provided. A serious limitation of sequence similarity-based approaches for identifying residue conservation among proteins is the low confidence in assigning residue-residue correspondences among proteins when the level of sequence identity between the compared proteins is poor. Multiple sequence alignment methods are more satisfactory--still, they cannot provide reliable results at low levels of sequence identity. Our goal in the current work was to develop an algorithm that could help overcome these difficulties by facilitating the identification of structurally (and possibly functionally) relevant residue-residue correspondences between compared protein structures. Here we present StralSV (structure-alignment sequence variability), a new algorithm for detecting closely related structure fragments and quantifying residue frequency from tight local structure alignments. We apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus, and we demonstrate that the algorithm can be used to determine regions of the protein that are relatively unique, or that share structural similarity with proteins that would be considered distantly related. By quantifying residue frequencies among many residue-residue pairs extracted from local structural alignments, one can infer potential structural or functional importance of specific residues that are determined to be highly conserved or that deviate from a consensus. We further demonstrate that considerable detailed structural and phylogenetic information can be derived from StralSV analyses. StralSV is a new structure-based algorithm for identifying and aligning structure fragments that have similarity to a reference protein. StralSV analysis can be used to quantify residue-residue correspondences and identify residues that may be of particular structural or functional importance, as well as unusual or unexpected residues at a given sequence position. StralSV is provided as a web service at http://proteinmodel.org/AS2TS/STRALSV/.
Richardson, Keith; Denny, Richard; Hughes, Chris; Skilling, John; Sikora, Jacek; Dadlez, Michał; Manteca, Angel; Jung, Hye Ryung; Jensen, Ole Nørregaard; Redeker, Virginie; Melki, Ronald; Langridge, James I.; Vissers, Johannes P.C.
2013-01-01
A probability-based quantification framework is presented for the calculation of relative peptide and protein abundance in label-free and label-dependent LC-MS proteomics data. The results are accompanied by credible intervals and regulation probabilities. The algorithm takes into account data uncertainties via Poisson statistics modified by a noise contribution that is determined automatically during an initial normalization stage. Protein quantification relies on assignments of component peptides to the acquired data. These assignments are generally of variable reliability and may not be present across all of the experiments comprising an analysis. It is also possible for a peptide to be identified to more than one protein in a given mixture. For these reasons the algorithm accepts a prior probability of peptide assignment for each intensity measurement. The model is constructed in such a way that outliers of any type can be automatically reweighted. Two discrete normalization methods can be employed. The first method is based on a user-defined subset of peptides, while the second method relies on the presence of a dominant background of endogenous peptides for which the concentration is assumed to be unaffected. Normalization is performed using the same computational and statistical procedures employed by the main quantification algorithm. The performance of the algorithm will be illustrated on example data sets, and its utility demonstrated for typical proteomics applications. The quantification algorithm supports relative protein quantification based on precursor and product ion intensities acquired by means of data-dependent methods, originating from all common isotopically-labeled approaches, as well as label-free ion intensity-based data-independent methods. PMID:22871168
Algorithm for designing smart factory Industry 4.0
NASA Astrophysics Data System (ADS)
Gurjanov, A. V.; Zakoldaev, D. A.; Shukalov, A. V.; Zharinov, I. O.
2018-03-01
The designing task of production division of the Industry 4.0 item designing company is being studied. The authors proposed an algorithm, which is based on the modified V L Volkovich method. This algorithm allows generating options how to arrange the production with robotized technological equipment functioning in the automatic mode. The optimization solution of the multi-criteria task for some additive criteria is the base of the algorithm.
Algorithms for output feedback, multiple-model, and decentralized control problems
NASA Technical Reports Server (NTRS)
Halyo, N.; Broussard, J. R.
1984-01-01
The optimal stochastic output feedback, multiple-model, and decentralized control problems with dynamic compensation are formulated and discussed. Algorithms for each problem are presented, and their relationship to a basic output feedback algorithm is discussed. An aircraft control design problem is posed as a combined decentralized, multiple-model, output feedback problem. A control design is obtained using the combined algorithm. An analysis of the design is presented.
Predicting protein functions from redundancies in large-scale protein interaction networks
NASA Technical Reports Server (NTRS)
Samanta, Manoj Pratim; Liang, Shoudan
2003-01-01
Interpreting data from large-scale protein interaction experiments has been a challenging task because of the widespread presence of random false positives. Here, we present a network-based statistical algorithm that overcomes this difficulty and allows us to derive functions of unannotated proteins from large-scale interaction data. Our algorithm uses the insight that if two proteins share significantly larger number of common interaction partners than random, they have close functional associations. Analysis of publicly available data from Saccharomyces cerevisiae reveals >2,800 reliable functional associations, 29% of which involve at least one unannotated protein. By further analyzing these associations, we derive tentative functions for 81 unannotated proteins with high certainty. Our method is not overly sensitive to the false positives present in the data. Even after adding 50% randomly generated interactions to the measured data set, we are able to recover almost all (approximately 89%) of the original associations.
Optimal design of low-density SNP arrays for genomic prediction: algorithm and applications
USDA-ARS?s Scientific Manuscript database
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for their optimal design. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optim...
Parallel optimization algorithms and their implementation in VLSI design
NASA Technical Reports Server (NTRS)
Lee, G.; Feeley, J. J.
1991-01-01
Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
Cyclic coordinate descent: A robotics algorithm for protein loop closure.
Canutescu, Adrian A; Dunbrack, Roland L
2003-05-01
In protein structure prediction, it is often the case that a protein segment must be adjusted to connect two fixed segments. This occurs during loop structure prediction in homology modeling as well as in ab initio structure prediction. Several algorithms for this purpose are based on the inverse Jacobian of the distance constraints with respect to dihedral angle degrees of freedom. These algorithms are sometimes unstable and fail to converge. We present an algorithm developed originally for inverse kinematics applications in robotics. In robotics, an end effector in the form of a robot hand must reach for an object in space by altering adjustable joint angles and arm lengths. In loop prediction, dihedral angles must be adjusted to move the C-terminal residue of a segment to superimpose on a fixed anchor residue in the protein structure. The algorithm, referred to as cyclic coordinate descent or CCD, involves adjusting one dihedral angle at a time to minimize the sum of the squared distances between three backbone atoms of the moving C-terminal anchor and the corresponding atoms in the fixed C-terminal anchor. The result is an equation in one variable for the proposed change in each dihedral. The algorithm proceeds iteratively through all of the adjustable dihedral angles from the N-terminal to the C-terminal end of the loop. CCD is suitable as a component of loop prediction methods that generate large numbers of trial structures. It succeeds in closing loops in a large test set 99.79% of the time, and fails occasionally only for short, highly extended loops. It is very fast, closing loops of length 8 in 0.037 sec on average.
Design optimization of steel frames using an enhanced firefly algorithm
NASA Astrophysics Data System (ADS)
Carbas, Serdar
2016-12-01
Mathematical modelling of real-world-sized steel frames under the Load and Resistance Factor Design-American Institute of Steel Construction (LRFD-AISC) steel design code provisions, where the steel profiles for the members are selected from a table of steel sections, turns out to be a discrete nonlinear programming problem. Finding the optimum design of such design optimization problems using classical optimization techniques is difficult. Metaheuristic algorithms provide an alternative way of solving such problems. The firefly algorithm (FFA) belongs to the swarm intelligence group of metaheuristics. The standard FFA has the drawback of being caught up in local optima in large-sized steel frame design problems. This study attempts to enhance the performance of the FFA by suggesting two new expressions for the attractiveness and randomness parameters of the algorithm. Two real-world-sized design examples are designed by the enhanced FFA and its performance is compared with standard FFA as well as with particle swarm and cuckoo search algorithms.
Flexible Space-Filling Designs for Complex System Simulations
2013-06-01
interior of the experimental region and cannot fit higher-order models. We present a genetic algorithm that constructs space-filling designs with...Computer Experiments, Design of Experiments, Genetic Algorithm , Latin Hypercube, Response Surface Methodology, Nearly Orthogonal 15. NUMBER OF PAGES 147...experimental region and cannot fit higher-order models. We present a genetic algorithm that constructs space-filling designs with minimal correlations
Genetic algorithms in conceptual design of a light-weight, low-noise, tilt-rotor aircraft
NASA Technical Reports Server (NTRS)
Wells, Valana L.
1996-01-01
This report outlines research accomplishments in the area of using genetic algorithms (GA) for the design and optimization of rotorcraft. It discusses the genetic algorithm as a search and optimization tool, outlines a procedure for using the GA in the conceptual design of helicopters, and applies the GA method to the acoustic design of rotors.
Expert-guided evolutionary algorithm for layout design of complex space stations
NASA Astrophysics Data System (ADS)
Qian, Zhiqin; Bi, Zhuming; Cao, Qun; Ju, Weiguo; Teng, Hongfei; Zheng, Yang; Zheng, Siyu
2017-08-01
The layout of a space station should be designed in such a way that different equipment and instruments are placed for the station as a whole to achieve the best overall performance. The station layout design is a typical nondeterministic polynomial problem. In particular, how to manage the design complexity to achieve an acceptable solution within a reasonable timeframe poses a great challenge. In this article, a new evolutionary algorithm has been proposed to meet such a challenge. It is called as the expert-guided evolutionary algorithm with a tree-like structure decomposition (EGEA-TSD). Two innovations in EGEA-TSD are (i) to deal with the design complexity, the entire design space is divided into subspaces with a tree-like structure; it reduces the computation and facilitates experts' involvement in the solving process. (ii) A human-intervention interface is developed to allow experts' involvement in avoiding local optimums and accelerating convergence. To validate the proposed algorithm, the layout design of one-space station is formulated as a multi-disciplinary design problem, the developed algorithm is programmed and executed, and the result is compared with those from other two algorithms; it has illustrated the superior performance of the proposed EGEA-TSD.
Detection of confinement and jumps in single-molecule membrane trajectories
NASA Astrophysics Data System (ADS)
Meilhac, N.; Le Guyader, L.; Salomé, L.; Destainville, N.
2006-01-01
We propose a variant of the algorithm by [R. Simson, E. D. Sheets, and K. Jacobson, Biophys. 69, 989 (1995)]. Their algorithm was developed to detect transient confinement zones in experimental single-particle tracking trajectories of diffusing membrane proteins or lipids. We show that our algorithm is able to detect confinement in a wider class of confining potential shapes than that of Simson Furthermore, it enables to detect not only temporary confinement but also jumps between confinement zones. Jumps are predicted by membrane skeleton fence and picket models. In the case of experimental trajectories of μ -opioid receptors, which belong to the family of G-protein-coupled receptors involved in a signal transduction pathway, this algorithm confirms that confinement cannot be explained solely by rigid fences.
A generalized algorithm to design finite field normal basis multipliers
NASA Technical Reports Server (NTRS)
Wang, C. C.
1986-01-01
Finite field arithmetic logic is central in the implementation of some error-correcting coders and some cryptographic devices. There is a need for good multiplication algorithms which can be easily realized. Massey and Omura recently developed a new multiplication algorithm for finite fields based on a normal basis representation. Using the normal basis representation, the design of the finite field multiplier is simple and regular. The fundamental design of the Massey-Omura multiplier is based on a design of a product function. In this article, a generalized algorithm to locate a normal basis in a field is first presented. Using this normal basis, an algorithm to construct the product function is then developed. This design does not depend on particular characteristics of the generator polynomial of the field.
A new algorithm for reliable and general NMR resonance assignment.
Schmidt, Elena; Güntert, Peter
2012-08-01
The new FLYA automated resonance assignment algorithm determines NMR chemical shift assignments on the basis of peak lists from any combination of multidimensional through-bond or through-space NMR experiments for proteins. Backbone and side-chain assignments can be determined. All experimental data are used simultaneously, thereby exploiting optimally the redundancy present in the input peak lists and circumventing potential pitfalls of assignment strategies in which results obtained in a given step remain fixed input data for subsequent steps. Instead of prescribing a specific assignment strategy, the FLYA resonance assignment algorithm requires only experimental peak lists and the primary structure of the protein, from which the peaks expected in a given spectrum can be generated by applying a set of rules, defined in a straightforward way by specifying through-bond or through-space magnetization transfer pathways. The algorithm determines the resonance assignment by finding an optimal mapping between the set of expected peaks that are assigned by definition but have unknown positions and the set of measured peaks in the input peak lists that are initially unassigned but have a known position in the spectrum. Using peak lists obtained by purely automated peak picking from the experimental spectra of three proteins, FLYA assigned correctly 96-99% of the backbone and 90-91% of all resonances that could be assigned manually. Systematic studies quantified the impact of various factors on the assignment accuracy, namely the extent of missing real peaks and the amount of additional artifact peaks in the input peak lists, as well as the accuracy of the peak positions. Comparing the resonance assignments from FLYA with those obtained from two other existing algorithms showed that using identical experimental input data these other algorithms yielded significantly (40-142%) more erroneous assignments than FLYA. The FLYA resonance assignment algorithm thus has the reliability and flexibility to replace most manual and semi-automatic assignment procedures for NMR studies of proteins.
Predicting Protein Structure Using Parallel Genetic Algorithms.
1994-12-01
Molecular dynamics attempts to simulate the protein folding process. However, the time steps required for this simulation are on the order of one...harmonics. These two factors have limited molecular dynamics simulations to less than a few nanoseconds (10-9 sec), even on today’s fastest supercomputers...By " Predicting rotein Structure D istribticfiar.. ................ Using Parallel Genetic Algorithms ,Avaiu " ’ •"... Dist THESIS I IGeorge H
Kotrri, Gynter; Fusch, Gerhard; Kwan, Celia; Choi, Dasol; Choi, Arum; Al Kafi, Nisreen; Rochow, Niels; Fusch, Christoph
2016-02-26
Commercial infrared (IR) milk analyzers are being increasingly used in research settings for the macronutrient measurement of breast milk (BM) prior to its target fortification. These devices, however, may not provide reliable measurement if not properly calibrated. In the current study, we tested a correction algorithm for a Near-IR milk analyzer (Unity SpectraStar, Brookfield, CT, USA) for fat and protein measurements, and examined the effect of pasteurization on the IR matrix and the stability of fat, protein, and lactose. Measurement values generated through Near-IR analysis were compared against those obtained through chemical reference methods to test the correction algorithm for the Near-IR milk analyzer. Macronutrient levels were compared between unpasteurized and pasteurized milk samples to determine the effect of pasteurization on macronutrient stability. The correction algorithm generated for our device was found to be valid for unpasteurized and pasteurized BM. Pasteurization had no effect on the macronutrient levels and the IR matrix of BM. These results show that fat and protein content can be accurately measured and monitored for unpasteurized and pasteurized BM. Of additional importance is the implication that donated human milk, generally low in protein content, has the potential to be target fortified.
Kotrri, Gynter; Fusch, Gerhard; Kwan, Celia; Choi, Dasol; Choi, Arum; Al Kafi, Nisreen; Rochow, Niels; Fusch, Christoph
2016-01-01
Commercial infrared (IR) milk analyzers are being increasingly used in research settings for the macronutrient measurement of breast milk (BM) prior to its target fortification. These devices, however, may not provide reliable measurement if not properly calibrated. In the current study, we tested a correction algorithm for a Near-IR milk analyzer (Unity SpectraStar, Brookfield, CT, USA) for fat and protein measurements, and examined the effect of pasteurization on the IR matrix and the stability of fat, protein, and lactose. Measurement values generated through Near-IR analysis were compared against those obtained through chemical reference methods to test the correction algorithm for the Near-IR milk analyzer. Macronutrient levels were compared between unpasteurized and pasteurized milk samples to determine the effect of pasteurization on macronutrient stability. The correction algorithm generated for our device was found to be valid for unpasteurized and pasteurized BM. Pasteurization had no effect on the macronutrient levels and the IR matrix of BM. These results show that fat and protein content can be accurately measured and monitored for unpasteurized and pasteurized BM. Of additional importance is the implication that donated human milk, generally low in protein content, has the potential to be target fortified. PMID:26927169
Effects of visualization on algorithm comprehension
NASA Astrophysics Data System (ADS)
Mulvey, Matthew
Computer science students are expected to learn and apply a variety of core algorithms which are an essential part of the field. Any one of these algorithms by itself is not necessarily extremely complex, but remembering the large variety of algorithms and the differences between them is challenging. To address this challenge, we present a novel algorithm visualization tool designed to enhance students understanding of Dijkstra's algorithm by allowing them to discover the rules of the algorithm for themselves. It is hoped that a deeper understanding of the algorithm will help students correctly select, adapt and apply the appropriate algorithm when presented with a problem to solve, and that what is learned here will be applicable to the design of other visualization tools designed to teach different algorithms. Our visualization tool is currently in the prototype stage, and this thesis will discuss the pedagogical approach that informs its design, as well as the results of some initial usability testing. Finally, to clarify the direction for further development of the tool, four different variations of the prototype were implemented, and the instructional effectiveness of each was assessed by having a small sample participants use the different versions of the prototype and then take a quiz to assess their comprehension of the algorithm.
Rathinam, Maniraj; Singh, Shweta; Pattanayak, Debasis; Sreevathsa, Rohini
2017-08-02
Development of chimeric Cry toxins by protein engineering of known and validated proteins is imperative for enhancing the efficacy and broadening the insecticidal spectrum of these genes. Expression of novel Cry proteins in food crops has however created apprehensions with respect to the safety aspects. To clarify this, premarket evaluation consisting of an array of analyses to evaluate the unintended effects is a prerequisite to provide safety assurance to the consumers. Additionally, series of bioinformatic tools as in silico aids are being used to evaluate the likely allergenic reaction of the proteins based on sequence and epitope similarity with known allergens. In the present study, chimeric Cry toxins developed through protein engineering were evaluated for allergenic potential using various in silico algorithms. Major emphasis was on the validation of allergenic potential on three aspects of paramount significance viz., sequence-based homology between allergenic proteins, validation of conformational epitopes towards identification of food allergens and physico-chemical properties of amino acids. Additionally, in vitro analysis pertaining to heat stability of two of the eight chimeric proteins and pepsin digestibility further demonstrated the non-allergenic potential of these chimeric toxins. The study revealed for the first time an all-encompassing evaluation that the recombinant Cry proteins did not show any potential similarity with any known allergens with respect to the parameters generally considered for a protein to be designated as an allergen. These novel chimeric proteins hence can be considered safe to be introgressed into plants.
Krill herd and piecewise-linear initialization algorithms for designing Takagi-Sugeno systems
NASA Astrophysics Data System (ADS)
Hodashinsky, I. A.; Filimonenko, I. V.; Sarin, K. S.
2017-07-01
A method for designing Takagi-Sugeno fuzzy systems is proposed which uses a piecewiselinear initialization algorithm for structure generation and a metaheuristic krill herd algorithm for parameter optimization. The obtained systems are tested against real data sets. The influence of some parameters of this algorithm on the approximation accuracy is analyzed. Estimates of the approximation accuracy and the number of fuzzy rules are compared with four known methods of design.
NASA Technical Reports Server (NTRS)
Roth, J. P.
1972-01-01
Methods for development of logic design together with algorithms for failure testing, a method for design of logic for ultra-large-scale integration, extension of quantum calculus to describe the functional behavior of a mechanism component-by-component and to computer tests for failures in the mechanism using the diagnosis algorithm, and the development of an algorithm for the multi-output 2-level minimization problem are discussed.
Optimal de novo design of MRM experiments for rapid assay development in targeted proteomics.
Bertsch, Andreas; Jung, Stephan; Zerck, Alexandra; Pfeifer, Nico; Nahnsen, Sven; Henneges, Carsten; Nordheim, Alfred; Kohlbacher, Oliver
2010-05-07
Targeted proteomic approaches such as multiple reaction monitoring (MRM) overcome problems associated with classical shotgun mass spectrometry experiments. Developing MRM quantitation assays can be time consuming, because relevant peptide representatives of the proteins must be found and their retention time and the product ions must be determined. Given the transitions, hundreds to thousands of them can be scheduled into one experiment run. However, it is difficult to select which of the transitions should be included into a measurement. We present a novel algorithm that allows the construction of MRM assays from the sequence of the targeted proteins alone. This enables the rapid development of targeted MRM experiments without large libraries of transitions or peptide spectra. The approach relies on combinatorial optimization in combination with machine learning techniques to predict proteotypicity, retention time, and fragmentation of peptides. The resulting potential transitions are scheduled optimally by solving an integer linear program. We demonstrate that fully automated construction of MRM experiments from protein sequences alone is possible and over 80% coverage of the targeted proteins can be achieved without further optimization of the assay.
Reliable numerical computation in an optimal output-feedback design
NASA Technical Reports Server (NTRS)
Vansteenwyk, Brett; Ly, Uy-Loi
1991-01-01
A reliable algorithm is presented for the evaluation of a quadratic performance index and its gradients with respect to the controller design parameters. The algorithm is a part of a design algorithm for optimal linear dynamic output-feedback controller that minimizes a finite-time quadratic performance index. The numerical scheme is particularly robust when it is applied to the control-law synthesis for systems with densely packed modes and where there is a high likelihood of encountering degeneracies in the closed-loop eigensystem. This approach through the use of an accurate Pade series approximation does not require the closed-loop system matrix to be diagonalizable. The algorithm was included in a control design package for optimal robust low-order controllers. Usefulness of the proposed numerical algorithm was demonstrated using numerous practical design cases where degeneracies occur frequently in the closed-loop system under an arbitrary controller design initialization and during the numerical search.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
NASA Technical Reports Server (NTRS)
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Prediction of protein-protein interaction network using a multi-objective optimization approach.
Chowdhury, Archana; Rakshit, Pratyusha; Konar, Amit
2016-06-01
Protein-Protein Interactions (PPIs) are very important as they coordinate almost all cellular processes. This paper attempts to formulate PPI prediction problem in a multi-objective optimization framework. The scoring functions for the trial solution deal with simultaneous maximization of functional similarity, strength of the domain interaction profiles, and the number of common neighbors of the proteins predicted to be interacting. The above optimization problem is solved using the proposed Firefly Algorithm with Nondominated Sorting. Experiments undertaken reveal that the proposed PPI prediction technique outperforms existing methods, including gene ontology-based Relative Specific Similarity, multi-domain-based Domain Cohesion Coupling method, domain-based Random Decision Forest method, Bagging with REP Tree, and evolutionary/swarm algorithm-based approaches, with respect to sensitivity, specificity, and F1 score.
Lorenzo, J Ramiro; Alonso, Leonardo G; Sánchez, Ignacio E
2015-01-01
Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage "Protein and nucleic acid structure and sequence analysis".
Quantum mechanical design of enzyme active sites.
Zhang, Xiyun; DeChancie, Jason; Gunaydin, Hakan; Chowdry, Arnab B; Clemente, Fernando R; Smith, Adam J T; Handel, T M; Houk, K N
2008-02-01
The design of active sites has been carried out using quantum mechanical calculations to predict the rate-determining transition state of a desired reaction in presence of the optimal arrangement of catalytic functional groups (theozyme). Eleven versatile reaction targets were chosen, including hydrolysis, dehydration, isomerization, aldol, and Diels-Alder reactions. For each of the targets, the predicted mechanism and the rate-determining transition state (TS) of the uncatalyzed reaction in water is presented. For the rate-determining TS, a catalytic site was designed using naturalistic catalytic units followed by an estimation of the rate acceleration provided by a reoptimization of the catalytic site. Finally, the geometries of the sites were compared to the X-ray structures of related natural enzymes. Recent advances in computational algorithms and power, coupled with successes in computational protein design, have provided a powerful context for undertaking such an endeavor. We propose that theozymes are excellent candidates to serve as the active site models for design processes.
MSPocket: an orientation-independent algorithm for the detection of ligand binding pockets.
Zhu, Hongbo; Pisabarro, M Teresa
2011-02-01
Identification of ligand binding pockets on proteins is crucial for the characterization of protein functions. It provides valuable information for protein-ligand docking and rational engineering of small molecules that regulate protein functions. A major number of current prediction algorithms of ligand binding pockets are based on cubic grid representation of proteins and, thus, the results are often protein orientation dependent. We present the MSPocket program for detecting pockets on the solvent excluded surface of proteins. The core algorithm of the MSPocket approach does not use any cubic grid system to represent proteins and is therefore independent of protein orientations. We demonstrate that MSPocket is able to achieve an accuracy of 75% in predicting ligand binding pockets on a test dataset used for evaluating several existing methods. The accuracy is 92% if the top three predictions are considered. Comparison to one of the recently published best performing methods shows that MSPocket reaches similar performance with the additional feature of being protein orientation independent. Interestingly, some of the predictions are different, meaning that the two methods can be considered complementary and combined to achieve better prediction accuracy. MSPocket also provides a graphical user interface for interactive investigation of the predicted ligand binding pockets. In addition, we show that overlap criterion is a better strategy for the evaluation of predicted ligand binding pockets than the single point distance criterion. The MSPocket source code can be downloaded from http://appserver.biotec.tu-dresden.de/MSPocket/. MSPocket is also available as a PyMOL plugin with a graphical user interface.
A comparison of kinematic algorithms to estimate gait events during overground running.
Smith, Laura; Preece, Stephen; Mason, Duncan; Bramah, Christopher
2015-01-01
The gait cycle is frequently divided into two distinct phases, stance and swing, which can be accurately determined from ground reaction force data. In the absence of such data, kinematic algorithms can be used to estimate footstrike and toe-off. The performance of previously published algorithms is not consistent between studies. Furthermore, previous algorithms have not been tested at higher running speeds nor used to estimate ground contact times. Therefore the purpose of this study was to both develop a new, custom-designed, event detection algorithm and compare its performance with four previously tested algorithms at higher running speeds. Kinematic and force data were collected on twenty runners during overground running at 5.6m/s. The five algorithms were then implemented and estimated times for footstrike, toe-off and contact time were compared to ground reaction force data. There were large differences in the performance of each algorithm. The custom-designed algorithm provided the most accurate estimation of footstrike (True Error 1.2 ± 17.1 ms) and contact time (True Error 3.5 ± 18.2 ms). Compared to the other tested algorithms, the custom-designed algorithm provided an accurate estimation of footstrike and toe-off across different footstrike patterns. The custom-designed algorithm provides a simple but effective method to accurately estimate footstrike, toe-off and contact time from kinematic data. Copyright © 2014 Elsevier B.V. All rights reserved.
Amber Plug-In for Protein Shop
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oliva, Ricardo
2004-05-10
The Amber Plug-in for ProteinShop has two main components: an AmberEngine library to compute the protein energy models, and a module to solve the energy minimization problem using an optimization algorithm in the OPTI-+ library. Together, these components allow the visualization of the protein folding process in ProteinShop. AmberEngine is a object-oriented library to compute molecular energies based on the Amber model. The main class is called ProteinEnergy. Its main interface methods are (1) "init" to initialize internal variables needed to compute the energy. (2) "eval" to evaluate the total energy given a vector of coordinates. Additional methods allow themore » user to evaluate the individual components of the energy model (bond, angle, dihedral, non-bonded-1-4, and non-bonded energies) and to obtain the energy of each individual atom. The Amber Engine library source code includes examples and test routines that illustrate the use of the library in stand alone programs. The energy minimization module uses the AmberEngine library and the nonlinear optimization library OPT++. OPT++ is open source software available under the GNU Lesser General Public License. The minimization module currently makes use of the LBFGS optimization algorithm in OPT++ to perform the energy minimization. Future releases may give the user a choice of other algorithms available in OPT++.« less
NASA Astrophysics Data System (ADS)
Camilloni, Carlo; Broglia, Ricardo A.; Tiana, Guido
2011-01-01
The study of the mechanism which is at the basis of the phenomenon of protein folding requires the knowledge of multiple folding trajectories under biological conditions. Using a biasing molecular-dynamics algorithm based on the physics of the ratchet-and-pawl system, we carry out all-atom, explicit solvent simulations of the sequence of folding events which proteins G, CI2, and ACBP undergo in evolving from the denatured to the folded state. Starting from highly disordered conformations, the algorithm allows the proteins to reach, at the price of a modest computational effort, nativelike conformations, within a root mean square deviation (RMSD) of approximately 1 Å. A scheme is developed to extract, from the myriad of events, information concerning the sequence of native contact formation and of their eventual correlation. Such an analysis indicates that all the studied proteins fold hierarchically, through pathways which, although not deterministic, are well-defined with respect to the order of contact formation. The algorithm also allows one to study unfolding, a process which looks, to a large extent, like the reverse of the major folding pathway. This is also true in situations in which many pathways contribute to the folding process, like in the case of protein G.
Chang, Yaw-Jen; Chang, Cheng-Hao
2016-06-01
Based on the principle of immobilized metal affinity chromatography (IMAC), it has been found that a Ni-Co alloy-coated protein chip is able to immobilize functional proteins with a His-tag attached. In this study, an intelligent computational approach was developed to promote the performance and repeatability of a Ni-Co alloy-coated protein chip. This approach was launched out of L18 experiments. Based on the experimental data, the fabrication process model of a Ni-Co protein chip was established by using an artificial neural network, and then an optimal fabrication condition was obtained using the Taguchi genetic algorithm. The result was validated experimentally and compared with a nitrocellulose chip. Consequentially, experimental outcomes revealed that the Ni-Co alloy-coated chip, fabricated using the proposed approach, had the best performance and repeatability compared with the Ni-Co chips of an L18 orthogonal array design and the nitrocellulose chip. Moreover, the low fluorescent background of the chip surface gives a more precise fluorescent detection. Based on a small quantity of experiments, this proposed intelligent computation approach can significantly reduce the experimental cost and improve the product's quality. © 2015 Society for Laboratory Automation and Screening.
COMPASS: a suite of pre- and post-search proteomics software tools for OMSSA
Wenger, Craig D.; Phanstiel, Douglas H.; Lee, M. Violet; Bailey, Derek J.; Coon, Joshua J.
2011-01-01
Here we present the Coon OMSSA Proteomic Analysis Software Suite (COMPASS): a free and open-source software pipeline for high-throughput analysis of proteomics data, designed around the Open Mass Spectrometry Search Algorithm. We detail a synergistic set of tools for protein database generation, spectral reduction, peptide false discovery rate analysis, peptide quantitation via isobaric labeling, protein parsimony and protein false discovery rate analysis, and protein quantitation. We strive for maximum ease of use, utilizing graphical user interfaces and working with data files in the original instrument vendor format. Results are stored in plain text comma-separated values files, which are easy to view and manipulate with a text editor or spreadsheet program. We illustrate the operation and efficacy of COMPASS through the use of two LC–MS/MS datasets. The first is a dataset of a highly annotated mixture of standard proteins and manually validated contaminants that exhibits the identification workflow. The second is a dataset of yeast peptides, labeled with isobaric stable isotope tags and mixed in known ratios, to demonstrate the quantitative workflow. For these two datasets, COMPASS performs equivalently or better than the current de facto standard, the Trans-Proteomic Pipeline. PMID:21298793
Omori, Satoshi; Kitao, Akio
2013-06-01
We propose a fast clustering and reranking method, CyClus, for protein-protein docking decoys. This method enables comprehensive clustering of whole decoys generated by rigid-body docking using cylindrical approximation of the protein-proteininterface and hierarchical clustering procedures. We demonstrate the clustering and reranking of 54,000 decoy structures generated by ZDOCK for each complex within a few minutes. After parameter tuning for the test set in ZDOCK benchmark 2.0 with the ZDOCK and ZRANK scoring functions, blind tests for the incremental data in ZDOCK benchmark 3.0 and 4.0 were conducted. CyClus successfully generated smaller subsets of decoys containing near-native decoys. For example, the number of decoys required to create subsets containing near-native decoys with 80% probability was reduced from 22% to 50% of the number required in the original ZDOCK. Although specific ZDOCK and ZRANK results were demonstrated, the CyClus algorithm was designed to be more general and can be applied to a wide range of decoys and scoring functions by adjusting just two parameters, p and T. CyClus results were also compared to those from ClusPro. Copyright © 2013 Wiley Periodicals, Inc.
Population entropies estimates of proteins
NASA Astrophysics Data System (ADS)
Low, Wai Yee
2017-05-01
The Shannon entropy equation provides a way to estimate variability of amino acids sequences in a multiple sequence alignment of proteins. Knowledge of protein variability is useful in many areas such as vaccine design, identification of antibody binding sites, and exploration of protein 3D structural properties. In cases where the population entropies of a protein are of interest but only a small sample size can be obtained, a method based on linear regression and random subsampling can be used to estimate the population entropy. This method is useful for comparisons of entropies where the actual sequence counts differ and thus, correction for alignment size bias is needed. In the current work, an R based package named EntropyCorrect that enables estimation of population entropy is presented and an empirical study on how well this new algorithm performs on simulated dataset of various combinations of population and sample sizes is discussed. The package is available at https://github.com/lloydlow/EntropyCorrect. This article, which was originally published online on 12 May 2017, contained an error in Eq. (1), where the summation sign was missing. The corrected equation appears in the Corrigendum attached to the pdf.
Krieger, Elmar; Dunbrack, Roland L; Hooft, Rob W W; Krieger, Barbara
2012-01-01
Among the many applications of molecular modeling, drug design is probably the one with the highest demands on the accuracy of the underlying structures. During lead optimization, the position of every atom in the binding site should ideally be known with high precision to identify those chemical modifications that are most likely to increase drug affinity. Unfortunately, X-ray crystallography at common resolution yields an electron density map that is too coarse, since the chemical elements and their protonation states cannot be fully resolved.This chapter describes the steps required to fill in the missing knowledge, by devising an algorithm that can detect and resolve the ambiguities. First, the pK (a) values of acidic and basic groups are predicted. Second, their potential protonation states are determined, including all permutations (considering for example protons that can jump between the oxygens of a phosphate group). Third, those groups of atoms are identified that can adopt alternative but indistinguishable conformations with essentially the same electron density. Fourth, potential hydrogen bond donors and acceptors are located. Finally, all these data are combined in a single "configuration energy function," whose global minimum is found with the SCWRL algorithm, which employs dead-end elimination and graph theory. As a result, one obtains a complete model of the protein and its bound ligand, with ambiguous groups rotated to the best orientation and with protonation states assigned considering the current pH and the H-bonding network. An implementation of the algorithm has been available since 2008 as part of the YASARA modeling & simulation program.
Dynamic X-ray diffraction sampling for protein crystal positioning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Scarborough, Nicole M.; Godaliyadda, G. M. Dilshan P.; Ye, Dong Hye
A sparse supervised learning approach for dynamic sampling (SLADS) is described for dose reduction in diffraction-based protein crystal positioning. Crystal centering is typically a prerequisite for macromolecular diffraction at synchrotron facilities, with X-ray diffraction mapping growing in popularity as a mechanism for localization. In X-ray raster scanning, diffraction is used to identify the crystal positions based on the detection of Bragg-like peaks in the scattering patterns; however, this additional X-ray exposure may result in detectable damage to the crystal prior to data collection. Dynamic sampling, in which preceding measurements inform the next most information-rich location to probe for image reconstruction,more » significantly reduced the X-ray dose experienced by protein crystals during positioning by diffraction raster scanning. The SLADS algorithm implemented herein is designed for single-pixel measurements and can select a new location to measure. In each step of SLADS, the algorithm selects the pixel, which, when measured, maximizes the expected reduction in distortion given previous measurements. Ground-truth diffraction data were obtained for a 5 µm-diameter beam and SLADS reconstructed the image sampling 31% of the total volume and only 9% of the interior of the crystal greatly reducing the X-ray dosage on the crystal. Furthermore, by usingin situtwo-photon-excited fluorescence microscopy measurements as a surrogate for diffraction imaging with a 1 µm-diameter beam, the SLADS algorithm enabled image reconstruction from a 7% sampling of the total volume and 12% sampling of the interior of the crystal. When implemented into the beamline at Argonne National Laboratory, without ground-truth images, an acceptable reconstruction was obtained with 3% of the image sampled and approximately 5% of the crystal. The incorporation of SLADS into X-ray diffraction acquisitions has the potential to significantly minimize the impact of X-ray exposure on the crystal by limiting the dose and area exposed for image reconstruction and crystal positioning using data collection hardware present in most macromolecular crystallography end-stations.« less
Dynamic X-ray diffraction sampling for protein crystal positioning
Scarborough, Nicole M.; Godaliyadda, G. M. Dilshan P.; Ye, Dong Hye; ...
2017-01-01
A sparse supervised learning approach for dynamic sampling (SLADS) is described for dose reduction in diffraction-based protein crystal positioning. Crystal centering is typically a prerequisite for macromolecular diffraction at synchrotron facilities, with X-ray diffraction mapping growing in popularity as a mechanism for localization. In X-ray raster scanning, diffraction is used to identify the crystal positions based on the detection of Bragg-like peaks in the scattering patterns; however, this additional X-ray exposure may result in detectable damage to the crystal prior to data collection. Dynamic sampling, in which preceding measurements inform the next most information-rich location to probe for image reconstruction,more » significantly reduced the X-ray dose experienced by protein crystals during positioning by diffraction raster scanning. The SLADS algorithm implemented herein is designed for single-pixel measurements and can select a new location to measure. In each step of SLADS, the algorithm selects the pixel, which, when measured, maximizes the expected reduction in distortion given previous measurements. Ground-truth diffraction data were obtained for a 5 µm-diameter beam and SLADS reconstructed the image sampling 31% of the total volume and only 9% of the interior of the crystal greatly reducing the X-ray dosage on the crystal. Furthermore, by usingin situtwo-photon-excited fluorescence microscopy measurements as a surrogate for diffraction imaging with a 1 µm-diameter beam, the SLADS algorithm enabled image reconstruction from a 7% sampling of the total volume and 12% sampling of the interior of the crystal. When implemented into the beamline at Argonne National Laboratory, without ground-truth images, an acceptable reconstruction was obtained with 3% of the image sampled and approximately 5% of the crystal. The incorporation of SLADS into X-ray diffraction acquisitions has the potential to significantly minimize the impact of X-ray exposure on the crystal by limiting the dose and area exposed for image reconstruction and crystal positioning using data collection hardware present in most macromolecular crystallography end-stations.« less
Dynamic X-ray diffraction sampling for protein crystal positioning
Scarborough, Nicole M.; Godaliyadda, G. M. Dilshan P.; Ye, Dong Hye; Kissick, David J.; Zhang, Shijie; Newman, Justin A.; Sheedlo, Michael J.; Chowdhury, Azhad U.; Fischetti, Robert F.; Das, Chittaranjan; Buzzard, Gregery T.; Bouman, Charles A.; Simpson, Garth J.
2017-01-01
A sparse supervised learning approach for dynamic sampling (SLADS) is described for dose reduction in diffraction-based protein crystal positioning. Crystal centering is typically a prerequisite for macromolecular diffraction at synchrotron facilities, with X-ray diffraction mapping growing in popularity as a mechanism for localization. In X-ray raster scanning, diffraction is used to identify the crystal positions based on the detection of Bragg-like peaks in the scattering patterns; however, this additional X-ray exposure may result in detectable damage to the crystal prior to data collection. Dynamic sampling, in which preceding measurements inform the next most information-rich location to probe for image reconstruction, significantly reduced the X-ray dose experienced by protein crystals during positioning by diffraction raster scanning. The SLADS algorithm implemented herein is designed for single-pixel measurements and can select a new location to measure. In each step of SLADS, the algorithm selects the pixel, which, when measured, maximizes the expected reduction in distortion given previous measurements. Ground-truth diffraction data were obtained for a 5 µm-diameter beam and SLADS reconstructed the image sampling 31% of the total volume and only 9% of the interior of the crystal greatly reducing the X-ray dosage on the crystal. Using in situ two-photon-excited fluorescence microscopy measurements as a surrogate for diffraction imaging with a 1 µm-diameter beam, the SLADS algorithm enabled image reconstruction from a 7% sampling of the total volume and 12% sampling of the interior of the crystal. When implemented into the beamline at Argonne National Laboratory, without ground-truth images, an acceptable reconstruction was obtained with 3% of the image sampled and approximately 5% of the crystal. The incorporation of SLADS into X-ray diffraction acquisitions has the potential to significantly minimize the impact of X-ray exposure on the crystal by limiting the dose and area exposed for image reconstruction and crystal positioning using data collection hardware present in most macromolecular crystallography end-stations. PMID:28009558
Dynamic X-ray diffraction sampling for protein crystal positioning.
Scarborough, Nicole M; Godaliyadda, G M Dilshan P; Ye, Dong Hye; Kissick, David J; Zhang, Shijie; Newman, Justin A; Sheedlo, Michael J; Chowdhury, Azhad U; Fischetti, Robert F; Das, Chittaranjan; Buzzard, Gregery T; Bouman, Charles A; Simpson, Garth J
2017-01-01
A sparse supervised learning approach for dynamic sampling (SLADS) is described for dose reduction in diffraction-based protein crystal positioning. Crystal centering is typically a prerequisite for macromolecular diffraction at synchrotron facilities, with X-ray diffraction mapping growing in popularity as a mechanism for localization. In X-ray raster scanning, diffraction is used to identify the crystal positions based on the detection of Bragg-like peaks in the scattering patterns; however, this additional X-ray exposure may result in detectable damage to the crystal prior to data collection. Dynamic sampling, in which preceding measurements inform the next most information-rich location to probe for image reconstruction, significantly reduced the X-ray dose experienced by protein crystals during positioning by diffraction raster scanning. The SLADS algorithm implemented herein is designed for single-pixel measurements and can select a new location to measure. In each step of SLADS, the algorithm selects the pixel, which, when measured, maximizes the expected reduction in distortion given previous measurements. Ground-truth diffraction data were obtained for a 5 µm-diameter beam and SLADS reconstructed the image sampling 31% of the total volume and only 9% of the interior of the crystal greatly reducing the X-ray dosage on the crystal. Using in situ two-photon-excited fluorescence microscopy measurements as a surrogate for diffraction imaging with a 1 µm-diameter beam, the SLADS algorithm enabled image reconstruction from a 7% sampling of the total volume and 12% sampling of the interior of the crystal. When implemented into the beamline at Argonne National Laboratory, without ground-truth images, an acceptable reconstruction was obtained with 3% of the image sampled and approximately 5% of the crystal. The incorporation of SLADS into X-ray diffraction acquisitions has the potential to significantly minimize the impact of X-ray exposure on the crystal by limiting the dose and area exposed for image reconstruction and crystal positioning using data collection hardware present in most macromolecular crystallography end-stations.
Phase Response Design of Recursive All-Pass Digital Filters Using a Modified PSO Algorithm
2015-01-01
This paper develops a new design scheme for the phase response of an all-pass recursive digital filter. A variant of particle swarm optimization (PSO) algorithm will be utilized for solving this kind of filter design problem. It is here called the modified PSO (MPSO) algorithm in which another adjusting factor is more introduced in the velocity updating formula of the algorithm in order to improve the searching ability. In the proposed method, all of the designed filter coefficients are firstly collected to be a parameter vector and this vector is regarded as a particle of the algorithm. The MPSO with a modified velocity formula will force all particles into moving toward the optimal or near optimal solution by minimizing some defined objective function of the optimization problem. To show the effectiveness of the proposed method, two different kinds of linear phase response design examples are illustrated and the general PSO algorithm is compared as well. The obtained results show that the MPSO is superior to the general PSO for the phase response design of digital recursive all-pass filter. PMID:26366168
Predicting protein-protein interactions from protein domains using a set cover approach.
Huang, Chengbang; Morcos, Faruck; Kanaan, Simon P; Wuchty, Stefan; Chen, Danny Z; Izaguirre, Jesús A
2007-01-01
One goal of contemporary proteome research is the elucidation of cellular protein interactions. Based on currently available protein-protein interaction and domain data, we introduce a novel method, Maximum Specificity Set Cover (MSSC), for the prediction of protein-protein interactions. In our approach, we map the relationship between interactions of proteins and their corresponding domain architectures to a generalized weighted set cover problem. The application of a greedy algorithm provides sets of domain interactions which explain the presence of protein interactions to the largest degree of specificity. Utilizing domain and protein interaction data of S. cerevisiae, MSSC enables prediction of previously unknown protein interactions, links that are well supported by a high tendency of coexpression and functional homogeneity of the corresponding proteins. Focusing on concrete examples, we show that MSSC reliably predicts protein interactions in well-studied molecular systems, such as the 26S proteasome and RNA polymerase II of S. cerevisiae. We also show that the quality of the predictions is comparable to the Maximum Likelihood Estimation while MSSC is faster. This new algorithm and all data sets used are accessible through a Web portal at http://ppi.cse.nd.edu.
NASA Technical Reports Server (NTRS)
Izumi, K. H.; Thompson, J. L.; Groce, J. L.; Schwab, R. W.
1986-01-01
The design requirements for a 4D path definition algorithm are described. These requirements were developed for the NASA ATOPS as an extension of the Local Flow Management/Profile Descent algorithm. They specify the processing flow, functional and data architectures, and system input requirements, and recommended the addition of a broad path revision (reinitialization) function capability. The document also summarizes algorithm design enhancements and the implementation status of the algorithm on an in-house PDP-11/70 computer. Finally, the requirements for the pilot-computer interfaces, the lateral path processor, and guidance and steering function are described.
AlignNemo: a local network alignment method to integrate homology and topology.
Ciriello, Giovanni; Mina, Marco; Guzzi, Pietro H; Cannataro, Mario; Guerra, Concettina
2012-01-01
Local network alignment is an important component of the analysis of protein-protein interaction networks that may lead to the identification of evolutionary related complexes. We present AlignNemo, a new algorithm that, given the networks of two organisms, uncovers subnetworks of proteins that relate in biological function and topology of interactions. The discovered conserved subnetworks have a general topology and need not to correspond to specific interaction patterns, so that they more closely fit the models of functional complexes proposed in the literature. The algorithm is able to handle sparse interaction data with an expansion process that at each step explores the local topology of the networks beyond the proteins directly interacting with the current solution. To assess the performance of AlignNemo, we ran a series of benchmarks using statistical measures as well as biological knowledge. Based on reference datasets of protein complexes, AlignNemo shows better performance than other methods in terms of both precision and recall. We show our solutions to be biologically sound using the concept of semantic similarity applied to Gene Ontology vocabularies. The binaries of AlignNemo and supplementary details about the algorithms and the experiments are available at: sourceforge.net/p/alignnemo.
Semantic similarity analysis of protein data: assessment with biological features and issues.
Guzzi, Pietro H; Mina, Marco; Guerra, Concettina; Cannataro, Mario
2012-09-01
The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.
Leung, Kin K.; Hause, Ronald J.; Barkinge, John L.; Ciaccio, Mark F.; Chuu, Chih-Pin; Jones, Richard B.
2014-01-01
Many human diseases are associated with aberrant regulation of phosphoprotein signaling networks. Src homology 2 (SH2) domains represent the major class of protein domains in metazoans that interact with proteins phosphorylated on the amino acid residue tyrosine. Although current SH2 domain prediction algorithms perform well at predicting the sequences of phosphorylated peptides that are likely to result in the highest possible interaction affinity in the context of random peptide library screens, these algorithms do poorly at predicting the interaction potential of SH2 domains with physiologically derived protein sequences. We employed a high throughput interaction assay system to empirically determine the affinity between 93 human SH2 domains and phosphopeptides abstracted from several receptor tyrosine kinases and signaling proteins. The resulting interaction experiments revealed over 1000 novel peptide-protein interactions and provided a glimpse into the common and specific interaction potentials of c-Met, c-Kit, GAB1, and the human androgen receptor. We used these data to build a permutation-based logistic regression classifier that performed considerably better than existing algorithms for predicting the interaction potential of several SH2 domains. PMID:24728074
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zemla, A; Lang, D; Kostova, T
2010-11-29
Most of the currently used methods for protein function prediction rely on sequence-based comparisons between a query protein and those for which a functional annotation is provided. A serious limitation of sequence similarity-based approaches for identifying residue conservation among proteins is the low confidence in assigning residue-residue correspondences among proteins when the level of sequence identity between the compared proteins is poor. Multiple sequence alignment methods are more satisfactory - still, they cannot provide reliable results at low levels of sequence identity. Our goal in the current work was to develop an algorithm that could overcome these difficulties and facilitatemore » the identification of structurally (and possibly functionally) relevant residue-residue correspondences between compared protein structures. Here we present StralSV, a new algorithm for detecting closely related structure fragments and quantifying residue frequency from tight local structure alignments. We apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus and demonstrate that the algorithm can be used to determine regions of the protein that are relatively unique or that shared structural similarity with structures that are distantly related. By quantifying residue frequencies among many residue-residue pairs extracted from local alignments, one can infer potential structural or functional importance of specific residues that are determined to be highly conserved or that deviate from a consensus. We further demonstrate that considerable detailed structural and phylogenetic information can be derived from StralSV analyses. StralSV is a new structure-based algorithm for identifying and aligning structure fragments that have similarity to a reference protein. StralSV analysis can be used to quantify residue-residue correspondences and identify residues that may be of particular structural or functional importance, as well as unusual or unexpected residues at a given sequence position.« less
17 CFR Appendix A to Part 38 - Guidance on Compliance With Designation Criteria
Code of Federal Regulations, 2011 CFR
2011-04-01
...-matching algorithm and order entry procedures. An application involving a trade-matching algorithm that is... algorithm. (b) A designated contract market's specifications on initial and periodic objective testing and...
17 CFR Appendix A to Part 38 - Guidance on Compliance With Designation Criteria
Code of Federal Regulations, 2012 CFR
2012-04-01
...-matching algorithm and order entry procedures. An application involving a trade-matching algorithm that is... algorithm. (b) A designated contract market's specifications on initial and periodic objective testing and...
Improved Algorithm For Finite-Field Normal-Basis Multipliers
NASA Technical Reports Server (NTRS)
Wang, C. C.
1989-01-01
Improved algorithm reduces complexity of calculations that must precede design of Massey-Omura finite-field normal-basis multipliers, used in error-correcting-code equipment and cryptographic devices. Algorithm represents an extension of development reported in "Algorithm To Design Finite-Field Normal-Basis Multipliers" (NPO-17109), NASA Tech Briefs, Vol. 12, No. 5, page 82.
Blank-Landeshammer, Bernhard; Kollipara, Laxmikanth; Biß, Karsten; Pfenninger, Markus; Malchow, Sebastian; Shuvaev, Konstantin; Zahedi, René P; Sickmann, Albert
2017-09-01
Complex mass spectrometry based proteomics data sets are mostly analyzed by protein database searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e., de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11 120 PSMs (combined) instead of 3476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
Non-uniform cosine modulated filter banks using meta-heuristic algorithms in CSD space.
Kalathil, Shaeen; Elias, Elizabeth
2015-11-01
This paper presents an efficient design of non-uniform cosine modulated filter banks (CMFB) using canonic signed digit (CSD) coefficients. CMFB has got an easy and efficient design approach. Non-uniform decomposition can be easily obtained by merging the appropriate filters of a uniform filter bank. Only the prototype filter needs to be designed and optimized. In this paper, the prototype filter is designed using window method, weighted Chebyshev approximation and weighted constrained least square approximation. The coefficients are quantized into CSD, using a look-up-table. The finite precision CSD rounding, deteriorates the filter bank performances. The performances of the filter bank are improved using suitably modified meta-heuristic algorithms. The different meta-heuristic algorithms which are modified and used in this paper are Artificial Bee Colony algorithm, Gravitational Search algorithm, Harmony Search algorithm and Genetic algorithm and they result in filter banks with less implementation complexity, power consumption and area requirements when compared with those of the conventional continuous coefficient non-uniform CMFB.
Non-uniform cosine modulated filter banks using meta-heuristic algorithms in CSD space
Kalathil, Shaeen; Elias, Elizabeth
2014-01-01
This paper presents an efficient design of non-uniform cosine modulated filter banks (CMFB) using canonic signed digit (CSD) coefficients. CMFB has got an easy and efficient design approach. Non-uniform decomposition can be easily obtained by merging the appropriate filters of a uniform filter bank. Only the prototype filter needs to be designed and optimized. In this paper, the prototype filter is designed using window method, weighted Chebyshev approximation and weighted constrained least square approximation. The coefficients are quantized into CSD, using a look-up-table. The finite precision CSD rounding, deteriorates the filter bank performances. The performances of the filter bank are improved using suitably modified meta-heuristic algorithms. The different meta-heuristic algorithms which are modified and used in this paper are Artificial Bee Colony algorithm, Gravitational Search algorithm, Harmony Search algorithm and Genetic algorithm and they result in filter banks with less implementation complexity, power consumption and area requirements when compared with those of the conventional continuous coefficient non-uniform CMFB. PMID:26644921
Recent progress and future directions in protein-protein docking.
Ritchie, David W
2008-02-01
This article gives an overview of recent progress in protein-protein docking and it identifies several directions for future research. Recent results from the CAPRI blind docking experiments show that docking algorithms are steadily improving in both reliability and accuracy. Current docking algorithms employ a range of efficient search and scoring strategies, including e.g. fast Fourier transform correlations, geometric hashing, and Monte Carlo techniques. These approaches can often produce a relatively small list of up to a few thousand orientations, amongst which a near-native binding mode is often observed. However, despite the use of improved scoring functions which typically include models of desolvation, hydrophobicity, and electrostatics, current algorithms still have difficulty in identifying the correct solution from the list of false positives, or decoys. Nonetheless, significant progress is being made through better use of bioinformatics, biochemical, and biophysical information such as e.g. sequence conservation analysis, protein interaction databases, alanine scanning, and NMR residual dipolar coupling restraints to help identify key binding residues. Promising new approaches to incorporate models of protein flexibility during docking are being developed, including the use of molecular dynamics snapshots, rotameric and off-rotamer searches, internal coordinate mechanics, and principal component analysis based techniques. Some investigators now use explicit solvent models in their docking protocols. Many of these approaches can be computationally intensive, although new silicon chip technologies such as programmable graphics processor units are beginning to offer competitive alternatives to conventional high performance computer systems. As cryo-EM techniques improve apace, docking NMR and X-ray protein structures into low resolution EM density maps is helping to bridge the resolution gap between these complementary techniques. The use of symmetry and fragment assembly constraints are also helping to make possible docking-based predictions of large multimeric protein complexes. In the near future, the closer integration of docking algorithms with protein interface prediction software, structural databases, and sequence analysis techniques should help produce better predictions of protein interaction networks and more accurate structural models of the fundamental molecular interactions within the cell.
NASA Astrophysics Data System (ADS)
Mahmood, Zakaria N.; Mahmuddin, Massudi; Mahmood, Mohammed Nooraldeen
Encoding proteins of amino acid sequence to predict classified into their respective families and subfamilies is important research area. However for a given protein, knowing the exact action whether hormonal, enzymatic, transmembranal or nuclear receptors does not depend solely on amino acid sequence but on the way the amino acid thread folds as well. This study provides a prototype system that able to predict a protein tertiary structure. Several methods are used to develop and evaluate the system to produce better accuracy in protein 3D structure prediction. The Bees Optimization algorithm which inspired from the honey bees food foraging method, is used in the searching phase. In this study, the experiment is conducted on short sequence proteins that have been used by the previous researches using well-known tools. The proposed approach shows a promising result.
Benchmarking Commercial Conformer Ensemble Generators.
Friedrich, Nils-Ole; de Bruyn Kops, Christina; Flachsenberg, Florian; Sommer, Kai; Rarey, Matthias; Kirchmair, Johannes
2017-11-27
We assess and compare the performance of eight commercial conformer ensemble generators (ConfGen, ConfGenX, cxcalc, iCon, MOE LowModeMD, MOE Stochastic, MOE Conformation Import, and OMEGA) and one leading free algorithm, the distance geometry algorithm implemented in RDKit. The comparative study is based on a new version of the Platinum Diverse Dataset, a high-quality benchmarking dataset of 2859 protein-bound ligand conformations extracted from the PDB. Differences in the performance of commercial algorithms are much smaller than those observed for free algorithms in our previous study (J. Chem. Inf. 2017, 57, 529-539). For commercial algorithms, the median minimum root-mean-square deviations measured between protein-bound ligand conformations and ensembles of a maximum of 250 conformers are between 0.46 and 0.61 Å. Commercial conformer ensemble generators are characterized by their high robustness, with at least 99% of all input molecules successfully processed and few or even no substantial geometrical errors detectable in their output conformations. The RDKit distance geometry algorithm (with minimization enabled) appears to be a good free alternative since its performance is comparable to that of the midranked commercial algorithms. Based on a statistical analysis, we elaborate on which algorithms to use and how to parametrize them for best performance in different application scenarios.
Optimizing high performance computing workflow for protein functional annotation.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-09-10
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Optimizing high performance computing workflow for protein functional annotation
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-01-01
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296
CAD system for footwear design based on whole real 3D data of last surface
NASA Astrophysics Data System (ADS)
Song, Wanzhong; Su, Xianyu
2000-10-01
Two major parts of application of CAD in footwear design are studied: the development of last surface; computer-aided design of planar shoe-template. A new quasi-experiential development algorithm of last surface based on triangulation approximation is presented. This development algorithm consumes less time and does not need any interactive operation for precisely development compared with other development algorithm of last surface. Based on this algorithm, a software, SHOEMAKERTM, which contains computer aided automatic measurement, automatic development of last surface and computer aide design of shoe-template has been developed.
Molecular Docking and Drug Discovery in β-Adrenergic Receptors.
Vilar, Santiago; Sobarzo-Sanchez, Eduardo; Santana, Lourdes; Uriarte, Eugenio
2017-01-01
Evolution in computer engineering, availability of increasing amounts of data and the development of new and fast docking algorithms and software have led to improved molecular simulations with crucial applications in virtual high-throughput screening and drug discovery. Moreover, analysis of protein-ligand recognition through molecular docking has become a valuable tool in drug design. In this review, we focus on the applicability of molecular docking on a particular class of G protein-coupled receptors: the β-adrenergic receptors, which are relevant targets in clinic for the treatment of asthma and cardiovascular diseases. We describe the binding site in β-adrenergic receptors to understand key factors in ligand recognition along with the proteins activation process. Moreover, we focus on the discovery of new lead compounds that bind the receptors, on the evaluation of virtual screening using the active/ inactive binding site states, and on the structural optimization of known families of binders to improve β-adrenergic affinity. We also discussed strengths and challenges related to the applicability of molecular docking in β-adrenergic receptors. Molecular docking is a valuable technique in computational chemistry to deeply analyze ligand recognition and has led to important breakthroughs in drug discovery and design in the field of β-adrenergic receptors. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
SCENERY: a web application for (causal) network reconstruction from cytometry data
Papoutsoglou, Georgios; Athineou, Giorgos; Lagani, Vincenzo; Xanthopoulos, Iordanis; Schmidt, Angelika; Éliás, Szabolcs; Tegnér, Jesper
2017-01-01
Abstract Flow and mass cytometry technologies can probe proteins as biological markers in thousands of individual cells simultaneously, providing unprecedented opportunities for reconstructing networks of protein interactions through machine learning algorithms. The network reconstruction (NR) problem has been well-studied by the machine learning community. However, the potentials of available methods remain largely unknown to the cytometry community, mainly due to their intrinsic complexity and the lack of comprehensive, powerful and easy-to-use NR software implementations specific for cytometry data. To bridge this gap, we present Single CEll NEtwork Reconstruction sYstem (SCENERY), a web server featuring several standard and advanced cytometry data analysis methods coupled with NR algorithms in a user-friendly, on-line environment. In SCENERY, users may upload their data and set their own study design. The server offers several data analysis options categorized into three classes of methods: data (pre)processing, statistical analysis and NR. The server also provides interactive visualization and download of results as ready-to-publish images or multimedia reports. Its core is modular and based on the widely-used and robust R platform allowing power users to extend its functionalities by submitting their own NR methods. SCENERY is available at scenery.csd.uoc.gr or http://mensxmachina.org/en/software/. PMID:28525568
AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures.
Zambrano, Rafael; Jamroz, Michal; Szczasiuk, Agata; Pujols, Jordi; Kmiecik, Sebastian; Ventura, Salvador
2015-07-01
Protein aggregation underlies an increasing number of disorders and constitutes a major bottleneck in the development of therapeutic proteins. Our present understanding on the molecular determinants of protein aggregation has crystalized in a series of predictive algorithms to identify aggregation-prone sites. A majority of these methods rely only on sequence. Therefore, they find difficulties to predict the aggregation properties of folded globular proteins, where aggregation-prone sites are often not contiguous in sequence or buried inside the native structure. The AGGRESCAN3D (A3D) server overcomes these limitations by taking into account the protein structure and the experimental aggregation propensity scale from the well-established AGGRESCAN method. Using the A3D server, the identified aggregation-prone residues can be virtually mutated to design variants with increased solubility, or to test the impact of pathogenic mutations. Additionally, A3D server enables to take into account the dynamic fluctuations of protein structure in solution, which may influence aggregation propensity. This is possible in A3D Dynamic Mode that exploits the CABS-flex approach for the fast simulations of flexibility of globular proteins. The A3D server can be accessed at http://biocomp.chem.uw.edu.pl/A3D/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Gradient gravitational search: An efficient metaheuristic algorithm for global optimization.
Dash, Tirtharaj; Sahu, Prabhat K
2015-05-30
The adaptation of novel techniques developed in the field of computational chemistry to solve the concerned problems for large and flexible molecules is taking the center stage with regard to efficient algorithm, computational cost and accuracy. In this article, the gradient-based gravitational search (GGS) algorithm, using analytical gradients for a fast minimization to the next local minimum has been reported. Its efficiency as metaheuristic approach has also been compared with Gradient Tabu Search and others like: Gravitational Search, Cuckoo Search, and Back Tracking Search algorithms for global optimization. Moreover, the GGS approach has also been applied to computational chemistry problems for finding the minimal value potential energy of two-dimensional and three-dimensional off-lattice protein models. The simulation results reveal the relative stability and physical accuracy of protein models with efficient computational cost. © 2015 Wiley Periodicals, Inc.
A novel in silico approach to drug discovery via computational intelligence.
Hecht, David; Fogel, Gary B
2009-04-01
A computational intelligence drug discovery platform is introduced as an innovative technology designed to accelerate high-throughput drug screening for generalized protein-targeted drug discovery. This technology results in collections of novel small molecule compounds that bind to protein targets as well as details on predicted binding modes and molecular interactions. The approach was tested on dihydrofolate reductase (DHFR) for novel antimalarial drug discovery; however, the methods developed can be applied broadly in early stage drug discovery and development. For this purpose, an initial fragment library was defined, and an automated fragment assembly algorithm was generated. These were combined with a computational intelligence screening tool for prescreening of compounds relative to DHFR inhibition. The entire method was assayed relative to spaces of known DHFR inhibitors and with chemical feasibility in mind, leading to experimental validation in future studies.
NASA Technical Reports Server (NTRS)
Roth, J. P.
1972-01-01
The following problems are considered: (1) methods for development of logic design together with algorithms, so that it is possible to compute a test for any failure in the logic design, if such a test exists, and developing algorithms and heuristics for the purpose of minimizing the computation for tests; and (2) a method of design of logic for ultra LSI (large scale integration). It was discovered that the so-called quantum calculus can be extended to render it possible: (1) to describe the functional behavior of a mechanism component by component, and (2) to compute tests for failures, in the mechanism, using the diagnosis algorithm. The development of an algorithm for the multioutput two-level minimization problem is presented and the program MIN 360 was written for this algorithm. The program has options of mode (exact minimum or various approximations), cost function, cost bound, etc., providing flexibility.
Chanona, J; Ribes, J; Seco, A; Ferrer, J
2006-01-01
This paper presents a model-knowledge based algorithm for optimising the primary sludge fermentation process design and operation. This is a recently used method to obtain the volatile fatty acids (VFA), needed to improve biological nutrient removal processes, directly from the raw wastewater. The proposed algorithm consists in a heuristic reasoning algorithm based on the expert knowledge of the process. Only effluent VFA and the sludge blanket height (SBH) have to be set as design criteria, and the optimisation algorithm obtains the minimum return sludge and waste sludge flow rates which fulfil those design criteria. A pilot plant fed with municipal raw wastewater was operated in order to obtain experimental results supporting the developed algorithm groundwork. The experimental results indicate that when SBH was increased, higher solids retention time was obtained in the settler and VFA production increased. Higher recirculation flow-rates resulted in higher VFA production too. Finally, the developed algorithm has been tested by simulating different design conditions with very good results. It has been able to find the optimal operation conditions in all cases on which preset design conditions could be achieved. Furthermore, this is a general algorithm that can be applied to any fermentation-elutriation scheme with or without fermentation reactor.
Combinatorial algorithms for design of DNA arrays.
Hannenhalli, Sridhar; Hubell, Earl; Lipshutz, Robert; Pevzner, Pavel A
2002-01-01
Optimal design of DNA arrays requires the development of algorithms with two-fold goals: reducing the effects caused by unintended illumination (border length minimization problem) and reducing the complexity of masks (mask decomposition problem). We describe algorithms that reduce the number of rectangles in mask decomposition by 20-30% as compared to a standard array design under the assumption that the arrangement of oligonucleotides on the array is fixed. This algorithm produces provably optimal solution for all studied real instances of array design. We also address the difficult problem of finding an arrangement which minimizes the border length and come up with a new idea of threading that significantly reduces the border length as compared to standard designs.
Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan
2018-04-01
Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.
An ensemble framework for clustering protein-protein interaction networks.
Asur, Sitaram; Ucar, Duygu; Parthasarathy, Srinivasan
2007-07-01
Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network. In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins. Supplementary data are available at Bioinformatics online.
Experimental designs for detecting synergy and antagonism between two drugs in a pre-clinical study.
Sperrin, Matthew; Thygesen, Helene; Su, Ting-Li; Harbron, Chris; Whitehead, Anne
2015-01-01
The identification of synergistic interactions between combinations of drugs is an important area within drug discovery and development. Pre-clinically, large numbers of screening studies to identify synergistic pairs of compounds can often be ran, necessitating efficient and robust experimental designs. We consider experimental designs for detecting interaction between two drugs in a pre-clinical in vitro assay in the presence of uncertainty of the monotherapy response. The monotherapies are assumed to follow the Hill equation with common lower and upper asymptotes, and a common variance. The optimality criterion used is the variance of the interaction parameter. We focus on ray designs and investigate two algorithms for selecting the optimum set of dose combinations. The first is a forward algorithm in which design points are added sequentially. This is found to give useful solutions in simple cases but can lack robustness when knowledge about the monotherapy parameters is insufficient. The second algorithm is a more pragmatic approach where the design points are constrained to be distributed log-normally along the rays and monotherapy doses. We find that the pragmatic algorithm is more stable than the forward algorithm, and even when the forward algorithm has converged, the pragmatic algorithm can still out-perform it. Practically, we find that good designs for detecting an interaction have equal numbers of points on monotherapies and combination therapies, with those points typically placed in positions where a 50% response is expected. More uncertainty in monotherapy parameters leads to an optimal design with design points that are more spread out. Copyright © 2015 John Wiley & Sons, Ltd.
Host-Guest Complexes with Protein-Ligand-Like Affinities: Computational Analysis and Design
Moghaddam, Sarvin; Inoue, Yoshihisa
2009-01-01
It has recently been discovered that guests combining a nonpolar core with cationic substituents bind cucurbit[7]uril (CB[7]) in water with ultra-high affinities. The present study uses the Mining Minima algorithm to study the physics of these extraordinary associations and to computationally test a new series of CB[7] ligands designed to bind with similarly high affinity. The calculations reproduce key experimental observations regarding the affinities of ferrocene-based guests with CB[7] and β-cyclodextrin and provide a coherent view of the roles of electrostatics and configurational entropy as determinants of affinity in these systems. The newly designed series of compounds is based on a bicyclo[2.2.2]octane core, which is similar in size and polarity to the ferrocene core of the existing series. Mining Minima predicts that these new compounds will, like the ferrocenes, bind CB[7] with extremely high affinities. PMID:19133781
Finding undetected protein associations in cell signaling by belief propagation.
Bailly-Bechet, M; Borgs, C; Braunstein, A; Chayes, J; Dagkessamanskaia, A; François, J-M; Zecchina, R
2011-01-11
External information propagates in the cell mainly through signaling cascades and transcriptional activation, allowing it to react to a wide spectrum of environmental changes. High-throughput experiments identify numerous molecular components of such cascades that may, however, interact through unknown partners. Some of them may be detected using data coming from the integration of a protein-protein interaction network and mRNA expression profiles. This inference problem can be mapped onto the problem of finding appropriate optimal connected subgraphs of a network defined by these datasets. The optimization procedure turns out to be computationally intractable in general. Here we present a new distributed algorithm for this task, inspired from statistical physics, and apply this scheme to alpha factor and drug perturbations data in yeast. We identify the role of the COS8 protein, a member of a gene family of previously unknown function, and validate the results by genetic experiments. The algorithm we present is specially suited for very large datasets, can run in parallel, and can be adapted to other problems in systems biology. On renowned benchmarks it outperforms other algorithms in the field.
QuickProbs 2: Towards rapid construction of high-quality alignments of large protein families
Gudyś, Adam; Deorowicz, Sebastian
2017-01-01
The ever-increasing size of sequence databases caused by the development of high throughput sequencing, poses to multiple alignment algorithms one of the greatest challenges yet. As we show, well-established techniques employed for increasing alignment quality, i.e., refinement and consistency, are ineffective when large protein families are investigated. We present QuickProbs 2, an algorithm for multiple sequence alignment. Based on probabilistic models, equipped with novel column-oriented refinement and selective consistency, it offers outstanding accuracy. When analysing hundreds of sequences, Quick-Probs 2 is noticeably better than ClustalΩ and MAFFT, the previous leaders for processing numerous protein families. In the case of smaller sets, for which consistency-based methods are the best performing, QuickProbs 2 is also superior to the competitors. Due to low computational requirements of selective consistency and utilization of massively parallel architectures, presented algorithm has similar execution times to ClustalΩ, and is orders of magnitude faster than full consistency approaches, like MSAProbs or PicXAA. All these make QuickProbs 2 an excellent tool for aligning families ranging from few, to hundreds of proteins. PMID:28139687
Advanced rotorcraft control using parameter optimization
NASA Technical Reports Server (NTRS)
Vansteenwyk, Brett; Ly, Uy-Loi
1991-01-01
A reliable algorithm for the evaluation of a quadratic performance index and its gradients with respect to the controller design parameters is presented. The algorithm is part of a design algorithm for an optimal linear dynamic output feedback controller that minimizes a finite time quadratic performance index. The numerical scheme is particularly robust when it is applied to the control law synthesis for systems with densely packed modes and where there is a high likelihood of encountering degeneracies in the closed loop eigensystem. This approach through the use of a accurate Pade series approximation does not require the closed loop system matrix to be diagonalizable. The algorithm has been included in a control design package for optimal robust low order controllers. Usefulness of the proposed numerical algorithm has been demonstrated using numerous practical design cases where degeneracies occur frequently in the closed loop system under an arbitrary controller design initialization and during the numerical search.
Scoring of Side-Chain Packings: An Analysis of Weight Factors and Molecular Dynamics Structures.
Colbes, Jose; Aguila, Sergio A; Brizuela, Carlos A
2018-02-26
The protein side-chain packing problem (PSCPP) is a central task in computational protein design. The problem is usually modeled as a combinatorial optimization problem, which consists of searching for a set of rotamers, from a given rotamer library, that minimizes a scoring function (SF). The SF is a weighted sum of terms, that can be decomposed in physics-based and knowledge-based terms. Although there are many methods to obtain approximate solutions for this problem, all of them have similar performances and there has not been a significant improvement in recent years. Studies on protein structure prediction and protein design revealed the limitations of current SFs to achieve further improvements for these two problems. In the same line, a recent work reported a similar result for the PSCPP. In this work, we ask whether or not this negative result regarding further improvements in performance is due to (i) an incorrect weighting of the SFs terms or (ii) the constrained conformation resulting from the protein crystallization process. To analyze these questions, we (i) model the PSCPP as a bi-objective combinatorial optimization problem, optimizing, at the same time, the two most important terms of two SFs of state-of-the-art algorithms and (ii) performed a preprocessing relaxation of the crystal structure through molecular dynamics to simulate the protein in the solvent and evaluated the performance of these two state-of-the-art SFs under these conditions. Our results indicate that (i) no matter what combination of weight factors we use the current SFs will not lead to better performances and (ii) the evaluated SFs will not be able to improve performance on relaxed structures. Furthermore, the experiments revealed that the SFs and the methods are biased toward crystallized structures.
SIMS: A Hybrid Method for Rapid Conformational Analysis
Gipson, Bryant; Moll, Mark; Kavraki, Lydia E.
2013-01-01
Proteins are at the root of many biological functions, often performing complex tasks as the result of large changes in their structure. Describing the exact details of these conformational changes, however, remains a central challenge for computational biology due the enormous computational requirements of the problem. This has engendered the development of a rich variety of useful methods designed to answer specific questions at different levels of spatial, temporal, and energetic resolution. These methods fall largely into two classes: physically accurate, but computationally demanding methods and fast, approximate methods. We introduce here a new hybrid modeling tool, the Structured Intuitive Move Selector (sims), designed to bridge the divide between these two classes, while allowing the benefits of both to be seamlessly integrated into a single framework. This is achieved by applying a modern motion planning algorithm, borrowed from the field of robotics, in tandem with a well-established protein modeling library. sims can combine precise energy calculations with approximate or specialized conformational sampling routines to produce rapid, yet accurate, analysis of the large-scale conformational variability of protein systems. Several key advancements are shown, including the abstract use of generically defined moves (conformational sampling methods) and an expansive probabilistic conformational exploration. We present three example problems that sims is applied to and demonstrate a rapid solution for each. These include the automatic determination of “active” residues for the hinge-based system Cyanovirin-N, exploring conformational changes involving long-range coordinated motion between non-sequential residues in Ribose-Binding Protein, and the rapid discovery of a transient conformational state of Maltose-Binding Protein, previously only determined by Molecular Dynamics. For all cases we provide energetic validations using well-established energy fields, demonstrating this framework as a fast and accurate tool for the analysis of a wide range of protein flexibility problems. PMID:23935893
Particle swarm optimization: an alternative in marine propeller optimization?
NASA Astrophysics Data System (ADS)
Vesting, F.; Bensow, R. E.
2018-01-01
This article deals with improving and evaluating the performance of two evolutionary algorithm approaches for automated engineering design optimization. Here a marine propeller design with constraints on cavitation nuisance is the intended application. For this purpose, the particle swarm optimization (PSO) algorithm is adapted for multi-objective optimization and constraint handling for use in propeller design. Three PSO algorithms are developed and tested for the optimization of four commercial propeller designs for different ship types. The results are evaluated by interrogating the generation medians and the Pareto front development. The same propellers are also optimized utilizing the well established NSGA-II genetic algorithm to provide benchmark results. The authors' PSO algorithms deliver comparable results to NSGA-II, but converge earlier and enhance the solution in terms of constraints violation.
Chandrasekaran, Srinivas Niranj; Das, Jhuma; Dokholyan, Nikolay V.; Carter, Charles W.
2016-01-01
PATH rapidly computes a path and a transition state between crystal structures by minimizing the Onsager-Machlup action. It requires input parameters whose range of values can generate different transition-state structures that cannot be uniquely compared with those generated by other methods. We outline modifications to estimate these input parameters to circumvent these difficulties and validate the PATH transition states by showing consistency between transition-states derived by different algorithms for unrelated protein systems. Although functional protein conformational change trajectories are to a degree stochastic, they nonetheless pass through a well-defined transition state whose detailed structural properties can rapidly be identified using PATH. PMID:26958584
We and others have shown that transition and maintenance of biological states is controlled by master regulator proteins, which can be inferred by interrogating tissue-specific regulatory models (interactomes) with transcriptional signatures, using the VIPER algorithm. Yet, some tissues may lack molecular profiles necessary for interactome inference (orphan tissues), or, as for single cells isolated from heterogeneous samples, their tissue context may be undetermined.
VASP-E: Specificity Annotation with a Volumetric Analysis of Electrostatic Isopotentials
Chen, Brian Y.
2014-01-01
Algorithms for comparing protein structure are frequently used for function annotation. By searching for subtle similarities among very different proteins, these algorithms can identify remote homologs with similar biological functions. In contrast, few comparison algorithms focus on specificity annotation, where the identification of subtle differences among very similar proteins can assist in finding small structural variations that create differences in binding specificity. Few specificity annotation methods consider electrostatic fields, which play a critical role in molecular recognition. To fill this gap, this paper describes VASP-E (Volumetric Analysis of Surface Properties with Electrostatics), a novel volumetric comparison tool based on the electrostatic comparison of protein-ligand and protein-protein binding sites. VASP-E exploits the central observation that three dimensional solids can be used to fully represent and compare both electrostatic isopotentials and molecular surfaces. With this integrated representation, VASP-E is able to dissect the electrostatic environments of protein-ligand and protein-protein binding interfaces, identifying individual amino acids that have an electrostatic influence on binding specificity. VASP-E was used to examine a nonredundant subset of the serine and cysteine proteases as well as the barnase-barstar and Rap1a-raf complexes. Based on amino acids established by various experimental studies to have an electrostatic influence on binding specificity, VASP-E identified electrostatically influential amino acids with 100% precision and 83.3% recall. We also show that VASP-E can accurately classify closely related ligand binding cavities into groups with different binding preferences. These results suggest that VASP-E should prove a useful tool for the characterization of specific binding and the engineering of binding preferences in proteins. PMID:25166865
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xi, T; Jones, I M; Mohrenweiser, H W
2003-11-03
Over 520 different amino acid substitution variants have been previously identified in the systematic screening of 91 human DNA repair genes for sequence variation. Two algorithms were employed to predict the impact of these amino acid substitutions on protein activity. Sorting Intolerant From Tolerant (SIFT) classified 226 of 508 variants (44%) as ''Intolerant''. Polymorphism Phenotyping (PolyPhen) classed 165 of 489 amino acid substitutions (34%) as ''Probably or Possibly Damaging''. Another 9-15% of the variants were classed as ''Potentially Intolerant or Damaging''. The results from the two algorithms are highly associated, with concordance in predicted impact observed for {approx}62% of themore » variants. Twenty one to thirty one percent of the variant proteins are predicted to exhibit reduced activity by both algorithms. These variants occur at slightly lower individual allele frequency than do the variants classified as ''Tolerant'' or ''Benign''. Both algorithms correctly predicted the impact of 26 functionally characterized amino acid substitutions in the APE1 protein on biochemical activity, with one exception. It is concluded that a substantial fraction of the missense variants observed in the general human population are functionally relevant. These variants are expected to be the molecular genetic and biochemical basis for the associations of reduced DNA repair capacity phenotypes with elevated cancer risk.« less
Zhang, Jian; Suo, Yan; Liu, Min; Xu, Xun
2018-06-01
Proliferative diabetic retinopathy (PDR) is one of the most common complications of diabetes and can lead to blindness. Proteomic studies have provided insight into the pathogenesis of PDR and a series of PDR-related genes has been identified but are far from fully characterized because the experimental methods are expensive and time consuming. In our previous study, we successfully identified 35 candidate PDR-related genes through the shortest-path algorithm. In the current study, we developed a computational method using the random walk with restart (RWR) algorithm and the protein-protein interaction (PPI) network to identify potential PDR-related genes. After some possible genes were obtained by the RWR algorithm, a three-stage filtration strategy, which includes the permutation test, interaction test and enrichment test, was applied to exclude potential false positives caused by the structure of PPI network, the poor interaction strength, and the limited similarity on gene ontology (GO) terms and biological pathways. As a result, 36 candidate genes were discovered by the method which was different from the 35 genes reported in our previous study. A literature review showed that 21 of these 36 genes are supported by previous experiments. These findings suggest the robustness and complementary effects of both our efforts using different computational methods, thus providing an alternative method to study PDR pathogenesis. Copyright © 2017 Elsevier B.V. All rights reserved.
2014-01-01
Background Protein-protein docking is an in silico method to predict the formation of protein complexes. Due to limited computational resources, the protein-protein docking approach has been developed under the assumption of rigid docking, in which one of the two protein partners remains rigid during the protein associations and water contribution is ignored or implicitly presented. Despite obtaining a number of acceptable complex predictions, it seems to-date that most initial rigid docking algorithms still find it difficult or even fail to discriminate successfully the correct predictions from the other incorrect or false positive ones. To improve the rigid docking results, re-ranking is one of the effective methods that help re-locate the correct predictions in top high ranks, discriminating them from the other incorrect ones. In this paper, we propose a new re-ranking technique using a new energy-based scoring function, namely IFACEwat - a combined Interface Atomic Contact Energy (IFACE) and water effect. The IFACEwat aims to further improve the discrimination of the near-native structures of the initial rigid docking algorithm ZDOCK3.0.2. Unlike other re-ranking techniques, the IFACEwat explicitly implements interfacial water into the protein interfaces to account for the water-mediated contacts during the protein interactions. Results Our results showed that the IFACEwat increased both the numbers of the near-native structures and improved their ranks as compared to the initial rigid docking ZDOCK3.0.2. In fact, the IFACEwat achieved a success rate of 83.8% for Antigen/Antibody complexes, which is 10% better than ZDOCK3.0.2. As compared to another re-ranking technique ZRANK, the IFACEwat obtains success rates of 92.3% (8% better) and 90% (5% better) respectively for medium and difficult cases. When comparing with the latest published re-ranking method F2Dock, the IFACEwat performed equivalently well or even better for several Antigen/Antibody complexes. Conclusions With the inclusion of interfacial water, the IFACEwat improves mostly results of the initial rigid docking, especially for Antigen/Antibody complexes. The improvement is achieved by explicitly taking into account the contribution of water during the protein interactions, which was ignored or not fully presented by the initial rigid docking and other re-ranking techniques. In addition, the IFACEwat maintains sufficient computational efficiency of the initial docking algorithm, yet improves the ranks as well as the number of the near native structures found. As our implementation so far targeted to improve the results of ZDOCK3.0.2, and particularly for the Antigen/Antibody complexes, it is expected in the near future that more implementations will be conducted to be applicable for other initial rigid docking algorithms. PMID:25521441
ten Cate, R; Lankester, A C; van der Straaten, P J C; van Suijlekom-Smit, L W A; Wit, J M
2002-06-29
Acute, non-traumatic joint complaints during childhood can be caused by conditions which require a quick and adequate recognition and treatment as well as by conditions in which an expectant policy can be pursued. On the basis of certain data from the anamnesis, supplemented with findings from the physical examination it is often possible to arrive at a (probable) diagnosis. An algorithm was designed, the differential steps of which were: fever, C-reactive protein titre, involvement of the hip joint, the presence of extra-articular manifestations and the results of a full blood count, erythrocyte sedimentation rate and imaging techniques. When this algorithm was retrospectively applied to the disease data of 115 children with acute, non-traumatic joint complaints, for whom the diagnosis in the status was taken as the gold standard, the correct diagnosis was established for every single child: for 98 (85.2%) by the shortest route and for 17 (14.8%) indirectly. In the case of 4 children, use of this algorithm would have led to unnecessary laboratory investigations and/or treatment. None of the diseases requiring immediate treatment were missed.
A firefly algorithm for optimum design of new-generation beams
NASA Astrophysics Data System (ADS)
Erdal, F.
2017-06-01
This research addresses the minimum weight design of new-generation steel beams with sinusoidal openings using a metaheuristic search technique, namely the firefly method. The proposed algorithm is also used to compare the optimum design results of sinusoidal web-expanded beams with steel castellated and cellular beams. Optimum design problems of all beams are formulated according to the design limitations stipulated by the Steel Construction Institute. The design methods adopted in these publications are consistent with BS 5950 specifications. The formulation of the design problem considering the above-mentioned limitations turns out to be a discrete programming problem. The design algorithms based on the technique select the optimum universal beam sections, dimensional properties of sinusoidal, hexagonal and circular holes, and the total number of openings along the beam as design variables. Furthermore, this selection is also carried out such that the behavioural limitations are satisfied. Numerical examples are presented, where the suggested algorithm is implemented to achieve the minimum weight design of these beams subjected to loading combinations.
Automatic protein structure solution from weak X-ray data
NASA Astrophysics Data System (ADS)
Skubák, Pavol; Pannu, Navraj S.
2013-11-01
Determining new protein structures from X-ray diffraction data at low resolution or with a weak anomalous signal is a difficult and often an impossible task. Here we propose a multivariate algorithm that simultaneously combines the structure determination steps. In tests on over 140 real data sets from the protein data bank, we show that this combined approach can automatically build models where current algorithms fail, including an anisotropically diffracting 3.88 Å RNA polymerase II data set. The method seamlessly automates the process, is ideal for non-specialists and provides a mathematical framework for successfully combining various sources of information in image processing.
An algorithm for the design and tuning of RF accelerating structures with variable cell lengths
NASA Astrophysics Data System (ADS)
Lal, Shankar; Pant, K. K.
2018-05-01
An algorithm is proposed for the design of a π mode standing wave buncher structure with variable cell lengths. It employs a two-parameter, multi-step approach for the design of the structure with desired resonant frequency and field flatness. The algorithm, along with analytical scaling laws for the design of the RF power coupling slot, makes it possible to accurately design the structure employing a freely available electromagnetic code like SUPERFISH. To compensate for machining errors, a tuning method has been devised to achieve desired RF parameters for the structure, which has been qualified by the successful tuning of a 7-cell buncher to π mode frequency of 2856 MHz with field flatness <3% and RF coupling coefficient close to unity. The proposed design algorithm and tuning method have demonstrated the feasibility of developing an S-band accelerating structure for desired RF parameters with a relatively relaxed machining tolerance of ∼ 25 μm. This paper discusses the algorithm for the design and tuning of an RF accelerating structure with variable cell lengths.
EMILiO: a fast algorithm for genome-scale strain design.
Yang, Laurence; Cluett, William R; Mahadevan, Radhakrishnan
2011-05-01
Systems-level design of cell metabolism is becoming increasingly important for renewable production of fuels, chemicals, and drugs. Computational models are improving in the accuracy and scope of predictions, but are also growing in complexity. Consequently, efficient and scalable algorithms are increasingly important for strain design. Previous algorithms helped to consolidate the utility of computational modeling in this field. To meet intensifying demands for high-performance strains, both the number and variety of genetic manipulations involved in strain construction are increasing. Existing algorithms have experienced combinatorial increases in computational complexity when applied toward the design of such complex strains. Here, we present EMILiO, a new algorithm that increases the scope of strain design to include reactions with individually optimized fluxes. Unlike existing approaches that would experience an explosion in complexity to solve this problem, we efficiently generated numerous alternate strain designs producing succinate, l-glutamate and l-serine. This was enabled by successive linear programming, a technique new to the area of computational strain design. Copyright © 2011 Elsevier Inc. All rights reserved.
Kazmier, Kelli; Alexander, Nathan S.; Meiler, Jens; Mchaourab, Hassane S.
2010-01-01
A hybrid protein structure determination approach combining sparse Electron Paramagnetic Resonance (EPR) distance restraints and Rosetta de novo protein folding has been previously demonstrated to yield high quality models (Alexander et al., 2008). However, widespread application of this methodology to proteins of unknown structures is hindered by the lack of a general strategy to place spin label pairs in the primary sequence. In this work, we report the development of an algorithm that optimally selects spin labeling positions for the purpose of distance measurements by EPR. For the α-helical subdomain of T4 lysozyme (T4L), simulated restraints that maximize sequence separation between the two spin labels while simultaneously ensuring pairwise connectivity of secondary structure elements yielded vastly improved models by Rosetta folding. 50% of all these models have the correct fold compared to only 21% and 8% correctly folded models when randomly placed restraints or no restraints are used, respectively. Moreover, the improvements in model quality require a limited number of optimized restraints, the number of which is determined by the pairwise connectivities of T4L α-helices. The predicted improvement in Rosetta model quality was verified by experimental determination of distances between spin labels pairs selected by the algorithm. Overall, our results reinforce the rationale for the combined use of sparse EPR distance restraints and de novo folding. By alleviating the experimental bottleneck associated with restraint selection, this algorithm sets the stage for extending computational structure determination to larger, traditionally elusive protein topologies of critical structural and biochemical importance. PMID:21074624
Parente, Daniel J; Ray, J Christian J; Swint-Kruse, Liskin
2015-12-01
As proteins evolve, amino acid positions key to protein structure or function are subject to mutational constraints. These positions can be detected by analyzing sequence families for amino acid conservation or for coevolution between pairs of positions. Coevolutionary scores are usually rank-ordered and thresholded to reveal the top pairwise scores, but they also can be treated as weighted networks. Here, we used network analyses to bypass a major complication of coevolution studies: For a given sequence alignment, alternative algorithms usually identify different, top pairwise scores. We reconciled results from five commonly-used, mathematically divergent algorithms (ELSC, McBASC, OMES, SCA, and ZNMI), using the LacI/GalR and 1,6-bisphosphate aldolase protein families as models. Calculations used unthresholded coevolution scores from which column-specific properties such as sequence entropy and random noise were subtracted; "central" positions were identified by calculating various network centrality scores. When compared among algorithms, network centrality methods, particularly eigenvector centrality, showed markedly better agreement than comparisons of the top pairwise scores. Positions with large centrality scores occurred at key structural locations and/or were functionally sensitive to mutations. Further, the top central positions often differed from those with top pairwise coevolution scores: instead of a few strong scores, central positions often had multiple, moderate scores. We conclude that eigenvector centrality calculations reveal a robust evolutionary pattern of constraints-detectable by divergent algorithms--that occur at key protein locations. Finally, we discuss the fact that multiple patterns coexist in evolutionary data that, together, give rise to emergent protein functions. © 2015 Wiley Periodicals, Inc.