Matt: local flexibility aids protein multiple structure alignment.
Menke, Matthew; Berger, Bonnie; Cowen, Lenore
2008-01-01
Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these "bent" alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of alpha-helices and beta-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.
Hagopian, Raffi; Davidson, John R; Datta, Ruchira S; Samad, Bushra; Jarvis, Glen R; Sjölander, Kimmen
2010-07-01
We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.
QUASAR--scoring and ranking of sequence-structure alignments.
Birzele, Fabian; Gewehr, Jan E; Zimmer, Ralf
2005-12-15
Sequence-structure alignments are a common means for protein structure prediction in the fields of fold recognition and homology modeling, and there is a broad variety of programs that provide such alignments based on sequence similarity, secondary structure or contact potentials. Nevertheless, finding the best sequence-structure alignment in a pool of alignments remains a difficult problem. QUASAR (quality of sequence-structure alignments ranking) provides a unifying framework for scoring sequence-structure alignments that aids finding well-performing combinations of well-known and custom-made scoring schemes. Those scoring functions can be benchmarked against widely accepted quality scores like MaxSub, TMScore, Touch and APDB, thus enabling users to test their own alignment scores against 'standard-of-truth' structure-based scores. Furthermore, individual score combinations can be optimized with respect to benchmark sets based on known structural relationships using QUASAR's in-built optimization routines.
Protein Identification Using Top-Down Spectra*
Liu, Xiaowen; Sirotkin, Yakov; Shen, Yufeng; Anderson, Gordon; Tsai, Yihsuan S.; Ting, Ying S.; Goodlett, David R.; Smith, Richard D.; Bafna, Vineet; Pevzner, Pavel A.
2012-01-01
In the last two years, because of advances in protein separation and mass spectrometry, top-down mass spectrometry moved from analyzing single proteins to analyzing complex samples and identifying hundreds and even thousands of proteins. However, computational tools for database search of top-down spectra against protein databases are still in their infancy. We describe MS-Align+, a fast algorithm for top-down protein identification based on spectral alignment that enables searches for unexpected post-translational modifications. We also propose a method for evaluating statistical significance of top-down protein identifications and further benchmark various software tools on two top-down data sets from Saccharomyces cerevisiae and Salmonella typhimurium. We demonstrate that MS-Align+ significantly increases the number of identified spectra as compared with MASCOT and OMSSA on both data sets. Although MS-Align+ and ProSightPC have similar performance on the Salmonella typhimurium data set, MS-Align+ outperforms ProSightPC on the (more complex) Saccharomyces cerevisiae data set. PMID:22027200
SVM-dependent pairwise HMM: an application to protein pairwise alignments.
Orlando, Gabriele; Raimondi, Daniele; Khan, Taushif; Lenaerts, Tom; Vranken, Wim F
2017-12-15
Methods able to provide reliable protein alignments are crucial for many bioinformatics applications. In the last years many different algorithms have been developed and various kinds of information, from sequence conservation to secondary structure, have been used to improve the alignment performances. This is especially relevant for proteins with highly divergent sequences. However, recent works suggest that different features may have different importance in diverse protein classes and it would be an advantage to have more customizable approaches, capable to deal with different alignment definitions. Here we present Rigapollo, a highly flexible pairwise alignment method based on a pairwise HMM-SVM that can use any type of information to build alignments. Rigapollo lets the user decide the optimal features to align their protein class of interest. It outperforms current state of the art methods on two well-known benchmark datasets when aligning highly divergent sequences. A Python implementation of the algorithm is available at http://ibsquare.be/rigapollo. wim.vranken@vub.be. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G
2012-09-01
Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a 1D sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods. Copyright © 2012 Elsevier Masson SAS. All rights reserved.
Adaptive Local Realignment of Protein Sequences.
DeBlasio, Dan; Kececioglu, John
2018-06-11
While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein's entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising, which finds global parameter settings for an aligner, to now adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment has been implemented within the Opal aligner using the Facet accuracy estimator.
Dong, Runze; Pan, Shuo; Peng, Zhenling; Zhang, Yang; Yang, Jianyi
2018-05-21
With the rapid increase of the number of protein structures in the Protein Data Bank, it becomes urgent to develop algorithms for efficient protein structure comparisons. In this article, we present the mTM-align server, which consists of two closely related modules: one for structure database search and the other for multiple structure alignment. The database search is speeded up based on a heuristic algorithm and a hierarchical organization of the structures in the database. The multiple structure alignment is performed using the recently developed algorithm mTM-align. Benchmark tests demonstrate that our algorithms outperform other peering methods for both modules, in terms of speed and accuracy. One of the unique features for the server is the interplay between database search and multiple structure alignment. The server provides service not only for performing fast database search, but also for making accurate multiple structure alignment with the structures found by the search. For the database search, it takes about 2-5 min for a structure of a medium size (∼300 residues). For the multiple structure alignment, it takes a few seconds for ∼10 structures of medium sizes. The server is freely available at: http://yanglab.nankai.edu.cn/mTM-align/.
Kinjo, Akira R.; Nakamura, Haruki
2012-01-01
Comparison and classification of protein structures are fundamental means to understand protein functions. Due to the computational difficulty and the ever-increasing amount of structural data, however, it is in general not feasible to perform exhaustive all-against-all structure comparisons necessary for comprehensive classifications. To efficiently handle such situations, we have previously proposed a method, now called GIRAF. We herein describe further improvements in the GIRAF protein structure search and alignment method. The GIRAF method achieves extremely efficient search of similar structures of ligand binding sites of proteins by exploiting database indexing of structural features of local coordinate frames. In addition, it produces refined atom-wise alignments by iterative applications of the Hungarian method to the bipartite graph defined for a pair of superimposed structures. By combining the refined alignments based on different local coordinate frames, it is made possible to align structures involving domain movements. We provide detailed accounts for the database design, the search and alignment algorithms as well as some benchmark results. PMID:27493524
CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area.
Terashi, Genki; Takeda-Shitaka, Mayuko
2015-01-01
Proteins are flexible, and this flexibility has an essential functional role. Flexibility can be observed in loop regions, rearrangements between secondary structure elements, and conformational changes between entire domains. However, most protein structure alignment methods treat protein structures as rigid bodies. Thus, these methods fail to identify the equivalences of residue pairs in regions with flexibility. In this study, we considered that the evolutionary relationship between proteins corresponds directly to the residue-residue physical contacts rather than the three-dimensional (3D) coordinates of proteins. Thus, we developed a new protein structure alignment method, contact area-based alignment (CAB-align), which uses the residue-residue contact area to identify regions of similarity. The main purpose of CAB-align is to identify homologous relationships at the residue level between related protein structures. The CAB-align procedure comprises two main steps: First, a rigid-body alignment method based on local and global 3D structure superposition is employed to generate a sufficient number of initial alignments. Then, iterative dynamic programming is executed to find the optimal alignment. We evaluated the performance and advantages of CAB-align based on four main points: (1) agreement with the gold standard alignment, (2) alignment quality based on an evolutionary relationship without 3D coordinate superposition, (3) consistency of the multiple alignments, and (4) classification agreement with the gold standard classification. Comparisons of CAB-align with other state-of-the-art protein structure alignment methods (TM-align, FATCAT, and DaliLite) using our benchmark dataset showed that CAB-align performed robustly in obtaining high-quality alignments and generating consistent multiple alignments with high coverage and accuracy rates, and it performed extremely well when discriminating between homologous and nonhomologous pairs of proteins in both single and multi-domain comparisons. The CAB-align software is freely available to academic users as stand-alone software at http://www.pharm.kitasato-u.ac.jp/bmd/bmd/Publications.html.
AlignNemo: a local network alignment method to integrate homology and topology.
Ciriello, Giovanni; Mina, Marco; Guzzi, Pietro H; Cannataro, Mario; Guerra, Concettina
2012-01-01
Local network alignment is an important component of the analysis of protein-protein interaction networks that may lead to the identification of evolutionary related complexes. We present AlignNemo, a new algorithm that, given the networks of two organisms, uncovers subnetworks of proteins that relate in biological function and topology of interactions. The discovered conserved subnetworks have a general topology and need not to correspond to specific interaction patterns, so that they more closely fit the models of functional complexes proposed in the literature. The algorithm is able to handle sparse interaction data with an expansion process that at each step explores the local topology of the networks beyond the proteins directly interacting with the current solution. To assess the performance of AlignNemo, we ran a series of benchmarks using statistical measures as well as biological knowledge. Based on reference datasets of protein complexes, AlignNemo shows better performance than other methods in terms of both precision and recall. We show our solutions to be biologically sound using the concept of semantic similarity applied to Gene Ontology vocabularies. The binaries of AlignNemo and supplementary details about the algorithms and the experiments are available at: sourceforge.net/p/alignnemo.
Simple chained guide trees give high-quality protein multiple sequence alignments
Boyce, Kieran; Sievers, Fabian; Higgins, Desmond G.
2014-01-01
Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random. PMID:25002495
Brown, Peter; Pullan, Wayne; Yang, Yuedong; Zhou, Yaoqi
2016-02-01
The three dimensional tertiary structure of a protein at near atomic level resolution provides insight alluding to its function and evolution. As protein structure decides its functionality, similarity in structure usually implies similarity in function. As such, structure alignment techniques are often useful in the classifications of protein function. Given the rapidly growing rate of new, experimentally determined structures being made available from repositories such as the Protein Data Bank, fast and accurate computational structure comparison tools are required. This paper presents SPalignNS, a non-sequential protein structure alignment tool using a novel asymmetrical greedy search technique. The performance of SPalignNS was evaluated against existing sequential and non-sequential structure alignment methods by performing trials with commonly used datasets. These benchmark datasets used to gauge alignment accuracy include (i) 9538 pairwise alignments implied by the HOMSTRAD database of homologous proteins; (ii) a subset of 64 difficult alignments from set (i) that have low structure similarity; (iii) 199 pairwise alignments of proteins with similar structure but different topology; and (iv) a subset of 20 pairwise alignments from the RIPC set. SPalignNS is shown to achieve greater alignment accuracy (lower or comparable root-mean squared distance with increased structure overlap coverage) for all datasets, and the highest agreement with reference alignments from the challenging dataset (iv) above, when compared with both sequentially constrained alignments and other non-sequential alignments. SPalignNS was implemented in C++. The source code, binary executable, and a web server version is freely available at: http://sparks-lab.org yaoqi.zhou@griffith.edu.au. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
The twilight zone of cis element alignments.
Sebastian, Alvaro; Contreras-Moreira, Bruno
2013-02-01
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein-DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein-DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments.
Li, Yang; Yang, Jianyi
2017-04-24
The prediction of protein-ligand binding affinity has recently been improved remarkably by machine-learning-based scoring functions. For example, using a set of simple descriptors representing the atomic distance counts, the RF-Score improves the Pearson correlation coefficient to about 0.8 on the core set of the PDBbind 2007 database, which is significantly higher than the performance of any conventional scoring function on the same benchmark. A few studies have been made to discuss the performance of machine-learning-based methods, but the reason for this improvement remains unclear. In this study, by systemically controlling the structural and sequence similarity between the training and test proteins of the PDBbind benchmark, we demonstrate that protein structural and sequence similarity makes a significant impact on machine-learning-based methods. After removal of training proteins that are highly similar to the test proteins identified by structure alignment and sequence alignment, machine-learning-based methods trained on the new training sets do not outperform the conventional scoring functions any more. On the contrary, the performance of conventional functions like X-Score is relatively stable no matter what training data are used to fit the weights of its energy terms.
Protein docking by the interface structure similarity: how much structure is needed?
Sinha, Rohita; Kundrotas, Petras J; Vakser, Ilya A
2012-01-01
The increasing availability of co-crystallized protein-protein complexes provides an opportunity to use template-based modeling for protein-protein docking. Structure alignment techniques are useful in detection of remote target-template similarities. The size of the structure involved in the alignment is important for the success in modeling. This paper describes a systematic large-scale study to find the optimal definition/size of the interfaces for the structure alignment-based docking applications. The results showed that structural areas corresponding to the cutoff values <12 Å across the interface inadequately represent structural details of the interfaces. With the increase of the cutoff beyond 12 Å, the success rate for the benchmark set of 99 protein complexes, did not increase significantly for higher accuracy models, and decreased for lower-accuracy models. The 12 Å cutoff was optimal in our interface alignment-based docking, and a likely best choice for the large-scale (e.g., on the scale of the entire genome) applications to protein interaction networks. The results provide guidelines for the docking approaches, including high-throughput applications to modeled structures.
CORAL: aligning conserved core regions across domain families.
Fong, Jessica H; Marchler-Bauer, Aron
2009-08-01
Homologous protein families share highly conserved sequence and structure regions that are frequent targets for comparative analysis of related proteins and families. Many protein families, such as the curated domain families in the Conserved Domain Database (CDD), exhibit similar structural cores. To improve accuracy in aligning such protein families, we propose a profile-profile method CORAL that aligns individual core regions as gap-free units. CORAL computes optimal local alignment of two profiles with heuristics to preserve continuity within core regions. We benchmarked its performance on curated domains in CDD, which have pre-defined core regions, against COMPASS, HHalign and PSI-BLAST, using structure superpositions and comprehensive curator-optimized alignments as standards of truth. CORAL improves alignment accuracy on core regions over general profile methods, returning a balanced score of 0.57 for over 80% of all domain families in CDD, compared with the highest balanced score of 0.45 from other methods. Further, CORAL provides E-values to aid in detecting homologous protein families and, by respecting block boundaries, produces alignments with improved 'readability' that facilitate manual refinement. CORAL will be included in future versions of the NCBI Cn3D/CDTree software, which can be downloaded at http://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml. Supplementary data are available at Bioinformatics online.
Floden, Evan W; Tommaso, Paolo D; Chatzou, Maria; Magis, Cedrik; Notredame, Cedric; Chang, Jia-Ming
2016-07-08
The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow databases of reduced complexity to rapidly perform homology extension. This server also gives the possibility to use transmembrane proteins (TMPs) reference databases to allow even faster homology extension on this important category of proteins. Aside from an MSA, the server also outputs topological prediction of TMPs using the HMMTOP algorithm. Previous benchmarking of the method has shown this approach outperforms the most accurate alignment methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. The web server is available at http://tcoffee.crg.cat/tmcoffee. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Standley, Daron M; Toh, Hiroyuki; Nakamura, Haruki
2008-09-01
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized. 2008 Wiley-Liss, Inc.
A benchmark testing ground for integrating homology modeling and protein docking.
Bohnuud, Tanggis; Luo, Lingqi; Wodak, Shoshana J; Bonvin, Alexandre M J J; Weng, Zhiping; Vajda, Sandor; Schueler-Furman, Ora; Kozakov, Dima
2017-01-01
Protein docking procedures carry out the task of predicting the structure of a protein-protein complex starting from the known structures of the individual protein components. More often than not, however, the structure of one or both components is not known, but can be derived by homology modeling on the basis of known structures of related proteins deposited in the Protein Data Bank (PDB). Thus, the problem is to develop methods that optimally integrate homology modeling and docking with the goal of predicting the structure of a complex directly from the amino acid sequences of its component proteins. One possibility is to use the best available homology modeling and docking methods. However, the models built for the individual subunits often differ to a significant degree from the bound conformation in the complex, often much more so than the differences observed between free and bound structures of the same protein, and therefore additional conformational adjustments, both at the backbone and side chain levels need to be modeled to achieve an accurate docking prediction. In particular, even homology models of overall good accuracy frequently include localized errors that unfavorably impact docking results. The predicted reliability of the different regions in the model can also serve as a useful input for the docking calculations. Here we present a benchmark dataset that should help to explore and solve combined modeling and docking problems. This dataset comprises a subset of the experimentally solved 'target' complexes from the widely used Docking Benchmark from the Weng Lab (excluding antibody-antigen complexes). This subset is extended to include the structures from the PDB related to those of the individual components of each complex, and hence represent potential templates for investigating and benchmarking integrated homology modeling and docking approaches. Template sets can be dynamically customized by specifying ranges in sequence similarity and in PDB release dates, or using other filtering options, such as excluding sets of specific structures from the template list. Multiple sequence alignments, as well as structural alignments of the templates to their corresponding subunits in the target are also provided. The resource is accessible online or can be downloaded at http://cluspro.org/benchmark, and is updated on a weekly basis in synchrony with new PDB releases. Proteins 2016; 85:10-16. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Ligand Binding Site Detection by Local Structure Alignment and Its Performance Complementarity
Lee, Hui Sun; Im, Wonpil
2013-01-01
Accurate determination of potential ligand binding sites (BS) is a key step for protein function characterization and structure-based drug design. Despite promising results of template-based BS prediction methods using global structure alignment (GSA), there is a room to improve the performance by properly incorporating local structure alignment (LSA) because BS are local structures and often similar for proteins with dissimilar global folds. We present a template-based ligand BS prediction method using G-LoSA, our LSA tool. A large benchmark set validation shows that G-LoSA predicts drug-like ligands’ positions in single-chain protein targets more precisely than TM-align, a GSA-based method, while the overall success rate of TM-align is better. G-LoSA is particularly efficient for accurate detection of local structures conserved across proteins with diverse global topologies. Recognizing the performance complementarity of G-LoSA to TM-align and a non-template geometry-based method, fpocket, a robust consensus scoring method, CMCS-BSP (Complementary Methods and Consensus Scoring for ligand Binding Site Prediction), is developed and shows improvement on prediction accuracy. The G-LoSA source code is freely available at http://im.bioinformatics.ku.edu/GLoSA. PMID:23957286
Tan, Yen Hock; Huang, He; Kihara, Daisuke
2006-08-15
Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.
Representing and comparing protein structures as paths in three-dimensional space
Zhi, Degui; Krishna, S Sri; Cao, Haibo; Pevzner, Pavel; Godzik, Adam
2006-01-01
Background Most existing formulations of protein structure comparison are based on detailed atomic level descriptions of protein structures and bypass potential insights that arise from a higher-level abstraction. Results We propose a structure comparison approach based on a simplified representation of proteins that describes its three-dimensional path by local curvature along the generalized backbone of the polypeptide. We have implemented a dynamic programming procedure that aligns curvatures of proteins by optimizing a defined sum turning angle deviation measure. Conclusion Although our procedure does not directly optimize global structural similarity as measured by RMSD, our benchmarking results indicate that it can surprisingly well recover the structural similarity defined by structure classification databases and traditional structure alignment programs. In addition, our program can recognize similarities between structures with extensive conformation changes that are beyond the ability of traditional structure alignment programs. We demonstrate the applications of procedure to several contexts of structure comparison. An implementation of our procedure, CURVE, is available as a public webserver. PMID:17052359
Template-Based Modeling of Protein-RNA Interactions.
Zheng, Jinfang; Kundrotas, Petras J; Vakser, Ilya A; Liu, Shiyong
2016-09-01
Protein-RNA complexes formed by specific recognition between RNA and RNA-binding proteins play an important role in biological processes. More than a thousand of such proteins in human are curated and many novel RNA-binding proteins are to be discovered. Due to limitations of experimental approaches, computational techniques are needed for characterization of protein-RNA interactions. Although much progress has been made, adequate methodologies reliably providing atomic resolution structural details are still lacking. Although protein-RNA free docking approaches proved to be useful, in general, the template-based approaches provide higher quality of predictions. Templates are key to building a high quality model. Sequence/structure relationships were studied based on a representative set of binary protein-RNA complexes from PDB. Several approaches were tested for pairwise target/template alignment. The analysis revealed a transition point between random and correct binding modes. The results showed that structural alignment is better than sequence alignment in identifying good templates, suitable for generating protein-RNA complexes close to the native structure, and outperforms free docking, successfully predicting complexes where the free docking fails, including cases of significant conformational change upon binding. A template-based protein-RNA interaction modeling protocol PRIME was developed and benchmarked on a representative set of complexes.
SANSparallel: interactive homology search against Uniprot
Somervuo, Panu; Holm, Liisa
2015-01-01
Proteins evolve by mutations and natural selection. The network of sequence similarities is a rich source for mining homologous relationships that inform on protein structure and function. There are many servers available to browse the network of homology relationships but one has to wait up to a minute for results. The SANSparallel webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. Benchmarks show that the method is highly competitive compared to previously published fast database search programs: UBLAST, DIAMOND, LAST, LAMBDA, RAPSEARCH2 and BLAT. The web server can be accessed interactively or programmatically at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi. It can be used to make protein functional annotation pipelines more efficient, and it is useful in interactive exploration of the detailed evidence supporting the annotation of particular proteins of interest. PMID:25855811
Simulation-based comprehensive benchmarking of RNA-seq aligners
Baruzzo, Giacomo; Hayer, Katharina E; Kim, Eun Ji; Di Camillo, Barbara; FitzGerald, Garret A; Grant, Gregory R
2018-01-01
Alignment is the first step in most RNA-seq analysis pipelines, and the accuracy of downstream analyses depends heavily on it. Unlike most steps in the pipeline, alignment is particularly amenable to benchmarking with simulated data. We performed a comprehensive benchmarking of 14 common splice-aware aligners for base, read, and exon junction-level accuracy and compared default with optimized parameters. We found that performance varied by genome complexity, and accuracy and popularity were poorly correlated. The most widely cited tool underperforms for most metrics, particularly when using default settings. PMID:27941783
Wang, Lei; You, Zhu-Hong; Chen, Xing; Li, Jian-Qiang; Yan, Xin; Zhang, Wei; Huang, Yu-An
2017-01-01
Protein–Protein Interactions (PPI) is not only the critical component of various biological processes in cells, but also the key to understand the mechanisms leading to healthy and diseased states in organisms. However, it is time-consuming and cost-intensive to identify the interactions among proteins using biological experiments. Hence, how to develop a more efficient computational method rapidly became an attractive topic in the post-genomic era. In this paper, we propose a novel method for inference of protein-protein interactions from protein amino acids sequences only. Specifically, protein amino acids sequence is firstly transformed into Position-Specific Scoring Matrix (PSSM) generated by multiple sequences alignments; then the Pseudo PSSM is used to extract feature descriptors. Finally, ensemble Rotation Forest (RF) learning system is trained to predict and recognize PPIs based solely on protein sequence feature. When performed the proposed method on the three benchmark data sets (Yeast, H. pylori, and independent dataset) for predicting PPIs, our method can achieve good average accuracies of 98.38%, 89.75%, and 96.25%, respectively. In order to further evaluate the prediction performance, we also compare the proposed method with other methods using same benchmark data sets. The experiment results demonstrate that the proposed method consistently outperforms other state-of-the-art method. Therefore, our method is effective and robust and can be taken as a useful tool in exploring and discovering new relationships between proteins. A web server is made publicly available at the URL http://202.119.201.126:8888/PsePSSM/ for academic use. PMID:28029645
Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.
In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less
Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric
Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.; ...
2015-10-09
In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less
The protein structure prediction problem could be solved using the current PDB library
Zhang, Yang; Skolnick, Jeffrey
2005-01-01
For single-domain proteins, we examine the completeness of the structures in the current Protein Data Bank (PDB) library for use in full-length model construction of unknown sequences. To address this issue, we employ a comprehensive benchmark set of 1,489 medium-size proteins that cover the PDB at the level of 35% sequence identity and identify templates by structure alignment. With homologous proteins excluded, we can always find similar folds to native with an average rms deviation (RMSD) from native of 2.5 Å with ≈82% alignment coverage. These template structures often contain a significant number of insertions/deletions. The tasser algorithm was applied to build full-length models, where continuous fragments are excised from the top-scoring templates and reassembled under the guide of an optimized force field, which includes consensus restraints taken from the templates and knowledge-based statistical potentials. For almost all targets (except for 2/1,489), the resultant full-length models have an RMSD to native below 6 Å (97% of them below 4 Å). On average, the RMSD of full-length models is 2.25 Å, with aligned regions improved from 2.5 Å to 1.88 Å, comparable with the accuracy of low-resolution experimental structures. Furthermore, starting from state-of-the-art structural alignments, we demonstrate a methodology that can consistently bring template-based alignments closer to native. These results are highly suggestive that the protein-folding problem can in principle be solved based on the current PDB library by developing efficient fold recognition algorithms that can recover such initial alignments. PMID:15653774
SANSparallel: interactive homology search against Uniprot.
Somervuo, Panu; Holm, Liisa
2015-07-01
Proteins evolve by mutations and natural selection. The network of sequence similarities is a rich source for mining homologous relationships that inform on protein structure and function. There are many servers available to browse the network of homology relationships but one has to wait up to a minute for results. The SANSparallel webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. Benchmarks show that the method is highly competitive compared to previously published fast database search programs: UBLAST, DIAMOND, LAST, LAMBDA, RAPSEARCH2 and BLAT. The web server can be accessed interactively or programmatically at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi. It can be used to make protein functional annotation pipelines more efficient, and it is useful in interactive exploration of the detailed evidence supporting the annotation of particular proteins of interest. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Template-Based Modeling of Protein-RNA Interactions
Zheng, Jinfang; Kundrotas, Petras J.; Vakser, Ilya A.
2016-01-01
Protein-RNA complexes formed by specific recognition between RNA and RNA-binding proteins play an important role in biological processes. More than a thousand of such proteins in human are curated and many novel RNA-binding proteins are to be discovered. Due to limitations of experimental approaches, computational techniques are needed for characterization of protein-RNA interactions. Although much progress has been made, adequate methodologies reliably providing atomic resolution structural details are still lacking. Although protein-RNA free docking approaches proved to be useful, in general, the template-based approaches provide higher quality of predictions. Templates are key to building a high quality model. Sequence/structure relationships were studied based on a representative set of binary protein-RNA complexes from PDB. Several approaches were tested for pairwise target/template alignment. The analysis revealed a transition point between random and correct binding modes. The results showed that structural alignment is better than sequence alignment in identifying good templates, suitable for generating protein-RNA complexes close to the native structure, and outperforms free docking, successfully predicting complexes where the free docking fails, including cases of significant conformational change upon binding. A template-based protein-RNA interaction modeling protocol PRIME was developed and benchmarked on a representative set of complexes. PMID:27662342
Accuracy Estimation and Parameter Advising for Protein Multiple Sequence Alignment
DeBlasio, Dan
2013-01-01
Abstract We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for “feature-based accuracy estimator”), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality. PMID:23489379
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Edgar, Robert C
2004-01-01
We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.
GOSSIP: a method for fast and accurate global alignment of protein structures.
Kifer, I; Nussinov, R; Wolfson, H J
2011-04-01
The database of known protein structures (PDB) is increasing rapidly. This results in a growing need for methods that can cope with the vast amount of structural data. To analyze the accumulating data, it is important to have a fast tool for identifying similar structures and clustering them by structural resemblance. Several excellent tools have been developed for the comparison of protein structures. These usually address the task of local structure alignment, an important yet computationally intensive problem due to its complexity. It is difficult to use such tools for comparing a large number of structures to each other at a reasonable time. Here we present GOSSIP, a novel method for a global all-against-all alignment of any set of protein structures. The method detects similarities between structures down to a certain cutoff (a parameter of the program), hence allowing it to detect similar structures at a much higher speed than local structure alignment methods. GOSSIP compares many structures in times which are several orders of magnitude faster than well-known available structure alignment servers, and it is also faster than a database scanning method. We evaluate GOSSIP both on a dataset of short structural fragments and on two large sequence-diverse structural benchmarks. Our conclusions are that for a threshold of 0.6 and above, the speed of GOSSIP is obtained with no compromise of the accuracy of the alignments or of the number of detected global similarities. A server, as well as an executable for download, are available at http://bioinfo3d.cs.tau.ac.il/gossip/.
Galpert, Deborah; Fernández, Alberto; Herrera, Francisco; Antunes, Agostinho; Molina-Ruiz, Reinaldo; Agüero-Chapin, Guillermin
2018-05-03
The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog classification approach implemented in a big data platform that considered several pairwise protein features and the low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International, 2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built with only alignment-based similarity measures or combined with several alignment-free pairwise protein features showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such supervised approaches outperformed traditional methods, there were no significant differences between the exclusive use of alignment-based similarity measures and their combination with alignment-free features, even within the twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free features related to amino acid composition. The incorporation of alignment-free features in supervised big data models did not significantly improve ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection methods encourages the evaluation of other alignment-free protein pair descriptors in future research.
Yamada, Kazunori D.; Tomii, Kentaro; Katoh, Kazutaka
2016-01-01
Motivation: Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones. Results: We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use. Availability and Implementation: http://mafft.cbrc.jp/alignment/software/ Contact: katoh@ifrec.osaka-u.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27378296
The twilight zone of cis element alignments
Sebastian, Alvaro; Contreras-Moreira, Bruno
2013-01-01
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein–DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein–DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments. PMID:23268451
Nair, Pradeep S; John, Eugene B
2007-01-01
Aligning specific sequences against a very large number of other sequences is a central aspect of bioinformatics. With the widespread availability of personal computers in biology laboratories, sequence alignment is now often performed locally. This makes it necessary to analyse the performance of personal computers for sequence aligning bioinformatics benchmarks. In this paper, we analyse the performance of a personal computer for the popular BLAST and FASTA sequence alignment suites. Results indicate that these benchmarks have a large number of recurring operations and use memory operations extensively. It seems that the performance can be improved with a bigger L1-cache.
The Zoo, Benchmarks & You: How To Reach the Oregon State Benchmarks with Zoo Resources.
ERIC Educational Resources Information Center
2002
This document aligns Oregon state educational benchmarks and standards with Oregon Zoo resources. Benchmark areas examined include English, mathematics, science, social studies, and career and life roles. Brief descriptions of the programs offered by the zoo are presented. (SOE)
Data processing has major impact on the outcome of quantitative label-free LC-MS analysis.
Chawade, Aakash; Sandin, Marianne; Teleman, Johan; Malmström, Johan; Levander, Fredrik
2015-02-06
High-throughput multiplexed protein quantification using mass spectrometry is steadily increasing in popularity, with the two major techniques being data-dependent acquisition (DDA) and targeted acquisition using selected reaction monitoring (SRM). However, both techniques involve extensive data processing, which can be performed by a multitude of different software solutions. Analysis of quantitative LC-MS/MS data is mainly performed in three major steps: processing of raw data, normalization, and statistical analysis. To evaluate the impact of data processing steps, we developed two new benchmark data sets, one each for DDA and SRM, with samples consisting of a long-range dilution series of synthetic peptides spiked in a total cell protein digest. The generated data were processed by eight different software workflows and three postprocessing steps. The results show that the choice of the raw data processing software and the postprocessing steps play an important role in the final outcome. Also, the linear dynamic range of the DDA data could be extended by an order of magnitude through feature alignment and a charge state merging algorithm proposed here. Furthermore, the benchmark data sets are made publicly available for further benchmarking and software developments.
Puton, Tomasz; Kozlowski, Lukasz P.; Rother, Kristian M.; Bujnicki, Janusz M.
2013-01-01
We present a continuous benchmarking approach for the assessment of RNA secondary structure prediction methods implemented in the CompaRNA web server. As of 3 October 2012, the performance of 28 single-sequence and 13 comparative methods has been evaluated on RNA sequences/structures released weekly by the Protein Data Bank. We also provide a static benchmark generated on RNA 2D structures derived from the RNAstrand database. Benchmarks on both data sets offer insight into the relative performance of RNA secondary structure prediction methods on RNAs of different size and with respect to different types of structure. According to our tests, on the average, the most accurate predictions obtained by a comparative approach are generated by CentroidAlifold, MXScarna, RNAalifold and TurboFold. On the average, the most accurate predictions obtained by single-sequence analyses are generated by CentroidFold, ContextFold and IPknot. The best comparative methods typically outperform the best single-sequence methods if an alignment of homologous RNA sequences is available. This article presents the results of our benchmarks as of 3 October 2012, whereas the rankings presented online are continuously updated. We will gladly include new prediction methods and new measures of accuracy in the new editions of CompaRNA benchmarks. PMID:23435231
Lee, Hui Sun; Im, Wonpil
2016-04-01
Molecular recognition by protein mostly occurs in a local region on the protein surface. Thus, an efficient computational method for accurate characterization of protein local structural conservation is necessary to better understand biology and drug design. We present a novel local structure alignment tool, G-LoSA. G-LoSA aligns protein local structures in a sequence order independent way and provides a GA-score, a chemical feature-based and size-independent structure similarity score. Our benchmark validation shows the robust performance of G-LoSA to the local structures of diverse sizes and characteristics, demonstrating its universal applicability to local structure-centric comparative biology studies. In particular, G-LoSA is highly effective in detecting conserved local regions on the entire surface of a given protein. In addition, the applications of G-LoSA to identifying template ligands and predicting ligand and protein binding sites illustrate its strong potential for computer-aided drug design. We hope that G-LoSA can be a useful computational method for exploring interesting biological problems through large-scale comparison of protein local structures and facilitating drug discovery research and development. G-LoSA is freely available to academic users at http://im.compbio.ku.edu/GLoSA/. © 2016 The Protein Society.
Do Plants Contain G Protein-Coupled Receptors?1[C][W][OPEN
Taddese, Bruck; Upton, Graham J.G.; Bailey, Gregory R.; Jordan, Siân R.D.; Abdulla, Nuradin Y.; Reeves, Philip J.; Reynolds, Christopher A.
2014-01-01
Whether G protein-coupled receptors (GPCRs) exist in plants is a fundamental biological question. Interest in deorphanizing new GPCRs arises because of their importance in signaling. Within plants, this is controversial, as genome analysis has identified 56 putative GPCRs, including G protein-coupled receptor1 (GCR1), which is reportedly a remote homolog to class A, B, and E GPCRs. Of these, GCR2 is not a GPCR; more recently, it has been proposed that none are, not even GCR1. We have addressed this disparity between genome analysis and biological evidence through a structural bioinformatics study, involving fold recognition methods, from which only GCR1 emerges as a strong candidate. To further probe GCR1, we have developed a novel helix-alignment method, which has been benchmarked against the class A-class B-class F GPCR alignments. In addition, we have presented a mutually consistent set of alignments of GCR1 homologs to class A, class B, and class F GPCRs and shown that GCR1 is closer to class A and/or class B GPCRs than class A, class B, or class F GPCRs are to each other. To further probe GCR1, we have aligned transmembrane helix 3 of GCR1 to each of the six GPCR classes. Variability comparisons provide additional evidence that GCR1 homologs have the GPCR fold. From the alignments and a GCR1 comparative model, we have identified motifs that are common to GCR1, class A, B, and E GPCRs. We discuss the possibilities that emerge from this controversial evidence that GCR1 has a GPCR fold. PMID:24246381
2016-01-01
Abstract Molecular recognition by protein mostly occurs in a local region on the protein surface. Thus, an efficient computational method for accurate characterization of protein local structural conservation is necessary to better understand biology and drug design. We present a novel local structure alignment tool, G‐LoSA. G‐LoSA aligns protein local structures in a sequence order independent way and provides a GA‐score, a chemical feature‐based and size‐independent structure similarity score. Our benchmark validation shows the robust performance of G‐LoSA to the local structures of diverse sizes and characteristics, demonstrating its universal applicability to local structure‐centric comparative biology studies. In particular, G‐LoSA is highly effective in detecting conserved local regions on the entire surface of a given protein. In addition, the applications of G‐LoSA to identifying template ligands and predicting ligand and protein binding sites illustrate its strong potential for computer‐aided drug design. We hope that G‐LoSA can be a useful computational method for exploring interesting biological problems through large‐scale comparison of protein local structures and facilitating drug discovery research and development. G‐LoSA is freely available to academic users at http://im.compbio.ku.edu/GLoSA/. PMID:26813336
SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.
Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver
2012-07-15
In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.
Jacquin, Hugo; Gilson, Amy; Shakhnovich, Eugene; Cocco, Simona; Monasson, Rémi
2016-05-01
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.
Discovering Sequence Motifs with Arbitrary Insertions and Deletions
Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan; Bailey, Timothy L.
2008-01-01
Biology is encoded in molecular sequences: deciphering this encoding remains a grand scientific challenge. Functional regions of DNA, RNA, and protein sequences often exhibit characteristic but subtle motifs; thus, computational discovery of motifs in sequences is a fundamental and much-studied problem. However, most current algorithms do not allow for insertions or deletions (indels) within motifs, and the few that do have other limitations. We present a method, GLAM2 (Gapped Local Alignment of Motifs), for discovering motifs allowing indels in a fully general manner, and a companion method GLAM2SCAN for searching sequence databases using such motifs. glam2 is a generalization of the gapless Gibbs sampling algorithm. It re-discovers variable-width protein motifs from the PROSITE database significantly more accurately than the alternative methods PRATT and SAM-T2K. Furthermore, it usefully refines protein motifs from the ELM database: in some cases, the refined motifs make orders of magnitude fewer overpredictions than the original ELM regular expressions. GLAM2 performs respectably on the BAliBASE multiple alignment benchmark, and may be superior to leading multiple alignment methods for “motif-like” alignments with N- and C-terminal extensions. Finally, we demonstrate the use of GLAM2 to discover protein kinase substrate motifs and a gapped DNA motif for the LIM-only transcriptional regulatory complex: using GLAM2SCAN, we identify promising targets for the latter. GLAM2 is especially promising for short protein motifs, and it should improve our ability to identify the protein cleavage sites, interaction sites, post-translational modification attachment sites, etc., that underlie much of biology. It may be equally useful for arbitrarily gapped motifs in DNA and RNA, although fewer examples of such motifs are known at present. GLAM2 is public domain software, available for download at http://bioinformatics.org.au/glam2. PMID:18437229
Rclick: a web server for comparison of RNA 3D structures.
Nguyen, Minh N; Verma, Chandra
2015-03-15
RNA molecules play important roles in key biological processes in the cell and are becoming attractive for developing therapeutic applications. Since the function of RNA depends on its structure and dynamics, comparing and classifying the RNA 3D structures is of crucial importance to molecular biology. In this study, we have developed Rclick, a web server that is capable of superimposing RNA 3D structures by using clique matching and 3D least-squares fitting. Our server Rclick has been benchmarked and compared with other popular servers and methods for RNA structural alignments. In most cases, Rclick alignments were better in terms of structure overlap. Our server also recognizes conformational changes between structures. For this purpose, the server produces complementary alignments to maximize the extent of detectable similarity. Various examples showcase the utility of our web server for comparison of RNA, RNA-protein complexes and RNA-ligand structures. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Sousa, Filipa L; Parente, Daniel J; Shis, David L; Hessman, Jacob A; Chazelle, Allen; Bennett, Matthew R; Teichmann, Sarah A; Swint-Kruse, Liskin
2016-02-22
Protein families evolve functional variation by accumulating point mutations at functionally important amino acid positions. Homologs in the LacI/GalR family of transcription regulators have evolved to bind diverse DNA sequences and allosteric regulatory molecules. In addition to playing key roles in bacterial metabolism, these proteins have been widely used as a model family for benchmarking structural and functional prediction algorithms. We have collected manually curated sequence alignments for >3000 sequences, in vivo phenotypic and biochemical data for >5750 LacI/GalR mutational variants, and noncovalent residue contact networks for 65 LacI/GalR homolog structures. Using this rich data resource, we compared the noncovalent residue contact networks of the LacI/GalR subfamilies to design and experimentally validate an allosteric mutant of a synthetic LacI/GalR repressor for use in biotechnology. The AlloRep database (freely available at www.AlloRep.org) is a key resource for future evolutionary studies of LacI/GalR homologs and for benchmarking computational predictions of functional change. Copyright © 2015 Elsevier Ltd. All rights reserved.
A Stochastic Point Cloud Sampling Method for Multi-Template Protein Comparative Modeling.
Li, Jilong; Cheng, Jianlin
2016-05-10
Generating tertiary structural models for a target protein from the known structure of its homologous template proteins and their pairwise sequence alignment is a key step in protein comparative modeling. Here, we developed a new stochastic point cloud sampling method, called MTMG, for multi-template protein model generation. The method first superposes the backbones of template structures, and the Cα atoms of the superposed templates form a point cloud for each position of a target protein, which are represented by a three-dimensional multivariate normal distribution. MTMG stochastically resamples the positions for Cα atoms of the residues whose positions are uncertain from the distribution, and accepts or rejects new position according to a simulated annealing protocol, which effectively removes atomic clashes commonly encountered in multi-template comparative modeling. We benchmarked MTMG on 1,033 sequence alignments generated for CASP9, CASP10 and CASP11 targets, respectively. Using multiple templates with MTMG improves the GDT-TS score and TM-score of structural models by 2.96-6.37% and 2.42-5.19% on the three datasets over using single templates. MTMG's performance was comparable to Modeller in terms of GDT-TS score, TM-score, and GDT-HA score, while the average RMSD was improved by a new sampling approach. The MTMG software is freely available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/mtmg.html.
A Stochastic Point Cloud Sampling Method for Multi-Template Protein Comparative Modeling
Li, Jilong; Cheng, Jianlin
2016-01-01
Generating tertiary structural models for a target protein from the known structure of its homologous template proteins and their pairwise sequence alignment is a key step in protein comparative modeling. Here, we developed a new stochastic point cloud sampling method, called MTMG, for multi-template protein model generation. The method first superposes the backbones of template structures, and the Cα atoms of the superposed templates form a point cloud for each position of a target protein, which are represented by a three-dimensional multivariate normal distribution. MTMG stochastically resamples the positions for Cα atoms of the residues whose positions are uncertain from the distribution, and accepts or rejects new position according to a simulated annealing protocol, which effectively removes atomic clashes commonly encountered in multi-template comparative modeling. We benchmarked MTMG on 1,033 sequence alignments generated for CASP9, CASP10 and CASP11 targets, respectively. Using multiple templates with MTMG improves the GDT-TS score and TM-score of structural models by 2.96–6.37% and 2.42–5.19% on the three datasets over using single templates. MTMG’s performance was comparable to Modeller in terms of GDT-TS score, TM-score, and GDT-HA score, while the average RMSD was improved by a new sampling approach. The MTMG software is freely available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/mtmg.html. PMID:27161489
ERIC Educational Resources Information Center
Barbour, Ross; Ostler, Catherine; Templeman, Elizabeth; West, Elizabeth
2007-01-01
The British Columbia (BC) English as a Second Language (ESL) Articulation Committee's Canadian Language Benchmarks project was precipitated by ESL instructors' desire to address transfer difficulties of ESL students within the BC transfer system and to respond to the recognition that the Canadian Language Benchmarks, a descriptive scale of ESL…
Lee, Seung Yup; Skolnick, Jeffrey
2007-07-01
To improve the accuracy of TASSER models especially in the limit where threading provided template alignments are of poor quality, we have developed the TASSER(iter) algorithm which uses the templates and contact restraints from TASSER generated models for iterative structure refinement. We apply TASSER(iter) to a large benchmark set of 2,773 nonhomologous single domain proteins that are < or = 200 in length and that cover the PDB at the level of 35% pairwise sequence identity. Overall, TASSER(iter) models have a smaller global average RMSD of 5.48 A compared to 5.81 A RMSD of the original TASSER models. Classifying the targets by the level of prediction difficulty (where Easy targets have a good template with a corresponding good threading alignment, Medium targets have a good template but a poor alignment, and Hard targets have an incorrectly identified template), TASSER(iter) (TASSER) models have an average RMSD of 4.15 A (4.35 A) for the Easy set and 9.05 A (9.52 A) for the Hard set. The largest reduction of average RMSD is for the Medium set where the TASSER(iter) models have an average global RMSD of 5.67 A compared to 6.72 A of the TASSER models. Seventy percent of the Medium set TASSER(iter) models have a smaller RMSD than the TASSER models, while 63% of the Easy and 60% of the Hard TASSER models are improved by TASSER(iter). For the foldable cases, where the targets have a RMSD to the native <6.5 A, TASSER(iter) shows obvious improvement over TASSER models: For the Medium set, it improves the success rate from 57.0 to 67.2%, followed by the Hard targets where the success rate improves from 32.0 to 34.8%, with the smallest improvement in the Easy targets from 82.6 to 84.0%. These results suggest that TASSER(iter) can provide more reliable predictions for targets of Medium difficulty, a range that had resisted improvement in the quality of protein structure predictions. 2007 Wiley-Liss, Inc.
LC-MSsim – a simulation software for liquid chromatography mass spectrometry data
Schulz-Trieglaff, Ole; Pfeifer, Nico; Gröpl, Clemens; Kohlbacher, Oliver; Reinert, Knut
2008-01-01
Background Mass Spectrometry coupled to Liquid Chromatography (LC-MS) is commonly used to analyze the protein content of biological samples in large scale studies. The data resulting from an LC-MS experiment is huge, highly complex and noisy. Accordingly, it has sparked new developments in Bioinformatics, especially in the fields of algorithm development, statistics and software engineering. In a quantitative label-free mass spectrometry experiment, crucial steps are the detection of peptide features in the mass spectra and the alignment of samples by correcting for shifts in retention time. At the moment, it is difficult to compare the plethora of algorithms for these tasks. So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment and filtering algorithms. Results We present LC-MSsim, a simulation software for LC-ESI-MS experiments. It simulates ESI spectra on the MS level. It reads a list of proteins from a FASTA file and digests the protein mixture using a user-defined enzyme. The software creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Our software also offers the possibility to add contaminants, to change the background noise level and includes a model for the detectability of peptides in mass spectra. After the simulation, LC-MSsim writes the simulated data to mzData, a public XML format. The software also stores the positions (monoisotopic m/z and retention time) and ion counts of the simulated ions in separate files. Conclusion LC-MSsim generates simulated LC-MS data sets and incorporates models for peak shapes and contaminations. Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed. We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools. PMID:18842122
Designing and benchmarking the MULTICOM protein structure prediction system
2013-01-01
Background Predicting protein structure from sequence is one of the most significant and challenging problems in bioinformatics. Numerous bioinformatics techniques and tools have been developed to tackle almost every aspect of protein structure prediction ranging from structural feature prediction, template identification and query-template alignment to structure sampling, model quality assessment, and model refinement. How to synergistically select, integrate and improve the strengths of the complementary techniques at each prediction stage and build a high-performance system is becoming a critical issue for constructing a successful, competitive protein structure predictor. Results Over the past several years, we have constructed a standalone protein structure prediction system MULTICOM that combines multiple sources of information and complementary methods at all five stages of the protein structure prediction process including template identification, template combination, model generation, model assessment, and model refinement. The system was blindly tested during the ninth Critical Assessment of Techniques for Protein Structure Prediction (CASP9) in 2010 and yielded very good performance. In addition to studying the overall performance on the CASP9 benchmark, we thoroughly investigated the performance and contributions of each component at each stage of prediction. Conclusions Our comprehensive and comparative study not only provides useful and practical insights about how to select, improve, and integrate complementary methods to build a cutting-edge protein structure prediction system but also identifies a few new sources of information that may help improve the design of a protein structure prediction system. Several components used in the MULTICOM system are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/. PMID:23442819
Di Tommaso, Paolo; Orobitg, Miquel; Guirado, Fernando; Cores, Fernado; Espinosa, Toni; Notredame, Cedric
2010-08-01
We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-house deployment. T-Coffee is a freeware open source package available from http://www.tcoffee.org/homepage.html
RBT-GA: a novel metaheuristic for solving the Multiple Sequence Alignment problem.
Taheri, Javid; Zomaya, Albert Y
2009-07-07
Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences.
Alignment methods: strategies, challenges, benchmarking, and comparative overview.
Löytynoja, Ari
2012-01-01
Comparative evolutionary analyses of molecular sequences are solely based on the identities and differences detected between homologous characters. Errors in this homology statement, that is errors in the alignment of the sequences, are likely to lead to errors in the downstream analyses. Sequence alignment and phylogenetic inference are tightly connected and many popular alignment programs use the phylogeny to divide the alignment problem into smaller tasks. They then neglect the phylogenetic tree, however, and produce alignments that are not evolutionarily meaningful. The use of phylogeny-aware methods reduces the error but the resulting alignments, with evolutionarily correct representation of homology, can challenge the existing practices and methods for viewing and visualising the sequences. The inter-dependency of alignment and phylogeny can be resolved by joint estimation of the two; methods based on statistical models allow for inferring the alignment parameters from the data and correctly take into account the uncertainty of the solution but remain computationally challenging. Widely used alignment methods are based on heuristic algorithms and unlikely to find globally optimal solutions. The whole concept of one correct alignment for the sequences is questionable, however, as there typically exist vast numbers of alternative, roughly equally good alignments that should also be considered. This uncertainty is hidden by many popular alignment programs and is rarely correctly taken into account in the downstream analyses. The quest for finding and improving the alignment solution is complicated by the lack of suitable measures of alignment goodness. The difficulty of comparing alternative solutions also affects benchmarks of alignment methods and the results strongly depend on the measure used. As the effects of alignment error cannot be predicted, comparing the alignments' performance in downstream analyses is recommended.
Ortuño, Francisco M; Valenzuela, Olga; Rojas, Fernando; Pomares, Hector; Florido, Javier P; Urquiza, Jose M; Rojas, Ignacio
2013-09-01
Multiple sequence alignments (MSAs) are widely used approaches in bioinformatics to carry out other tasks such as structure predictions, biological function analyses or phylogenetic modeling. However, current tools usually provide partially optimal alignments, as each one is focused on specific biological features. Thus, the same set of sequences can produce different alignments, above all when sequences are less similar. Consequently, researchers and biologists do not agree about which is the most suitable way to evaluate MSAs. Recent evaluations tend to use more complex scores including further biological features. Among them, 3D structures are increasingly being used to evaluate alignments. Because structures are more conserved in proteins than sequences, scores with structural information are better suited to evaluate more distant relationships between sequences. The proposed multiobjective algorithm, based on the non-dominated sorting genetic algorithm, aims to jointly optimize three objectives: STRIKE score, non-gaps percentage and totally conserved columns. It was significantly assessed on the BAliBASE benchmark according to the Kruskal-Wallis test (P < 0.01). This algorithm also outperforms other aligners, such as ClustalW, Multiple Sequence Alignment Genetic Algorithm (MSA-GA), PRRP, DIALIGN, Hidden Markov Model Training (HMMT), Pattern-Induced Multi-sequence Alignment (PIMA), MULTIALIGN, Sequence Alignment Genetic Algorithm (SAGA), PILEUP, Rubber Band Technique Genetic Algorithm (RBT-GA) and Vertical Decomposition Genetic Algorithm (VDGA), according to the Wilcoxon signed-rank test (P < 0.05), whereas it shows results not significantly different to 3D-COFFEE (P > 0.05) with the advantage of being able to use less structures. Structural information is included within the objective function to evaluate more accurately the obtained alignments. The source code is available at http://www.ugr.es/~fortuno/MOSAStrE/MO-SAStrE.zip.
SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2.
Szalkowski, Adam; Ledergerber, Christian; Krähenbühl, Philipp; Dessimoz, Christophe
2008-10-29
We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and x86 architectures. The paper describes swps3 and compares its performances with several other implementations. Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences.
SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2
Szalkowski, Adam; Ledergerber, Christian; Krähenbühl, Philipp; Dessimoz, Christophe
2008-01-01
Background We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and ×86 architectures. The paper describes swps3 and compares its performances with several other implementations. Findings Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. Conclusion The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences. PMID:18959793
Hu, Jun; Liu, Zi; Yu, Dong-Jun; Zhang, Yang
2018-02-15
Sequence-order independent structural comparison, also called structural alignment, of small ligand molecules is often needed for computer-aided virtual drug screening. Although many ligand structure alignment programs are proposed, most of them build the alignments based on rigid-body shape comparison which cannot provide atom-specific alignment information nor allow structural variation; both abilities are critical to efficient high-throughput virtual screening. We propose a novel ligand comparison algorithm, LS-align, to generate fast and accurate atom-level structural alignments of ligand molecules, through an iterative heuristic search of the target function that combines inter-atom distance with mass and chemical bond comparisons. LS-align contains two modules of Rigid-LS-align and Flexi-LS-align, designed for rigid-body and flexible alignments, respectively, where a ligand-size independent, statistics-based scoring function is developed to evaluate the similarity of ligand molecules relative to random ligand pairs. Large-scale benchmark tests are performed on prioritizing chemical ligands of 102 protein targets involving 1,415,871 candidate compounds from the DUD-E (Database of Useful Decoys: Enhanced) database, where LS-align achieves an average enrichment factor (EF) of 22.0 at the 1% cutoff and the AUC score of 0.75, which are significantly higher than other state-of-the-art methods. Detailed data analyses show that the advanced performance is mainly attributed to the design of the target function that combines structural and chemical information to enhance the sensitivity of recognizing subtle difference of ligand molecules and the introduces of structural flexibility that help capture the conformational changes induced by the ligand-receptor binding interactions. These data demonstrate a new avenue to improve the virtual screening efficiency through the development of sensitive ligand structural alignments. http://zhanglab.ccmb.med.umich.edu/LS-align/. njyudj@njust.edu.cn or zhng@umich.edu. Supplementary data are available at Bioinformatics online.
The Alignment of easyCBM[R] Math Measures to Curriculum Standards. Technical Report #1002
ERIC Educational Resources Information Center
Nese, Joseph F. T.; Lai, Cheng-Fei; Anderson, Daniel; Park, Bitnara Jasmine; Tindal, Gerald; Alonzo, Julie
2010-01-01
The purpose of this study was to examine the alignment of the easyCBM[R] mathematics benchmark and progress monitoring measures to the National Council of Teachers of Mathematics "Curriculum Focal Points" (NCTM, 2006). Based on Webb's alignment model (1997, 2002), we collected expert judgments on individual math items across a sampling of forms…
Organizational Alignment Supporting Distance Education in Post-Secondary Institutions.
ERIC Educational Resources Information Center
Prestera, Gustavo E.; Moller, Leslie A.
2001-01-01
Applies an established model of organizational alignment to distance education in postsecondary institutions and recommends performance-oriented approaches to support growth by analyzing goals, structure, and management practices across the organization. Presents performance improvement strategies such as benchmarking and documenting workflows,…
Liu, Bin; Wang, Xiaolong; Lin, Lei; Dong, Qiwen; Wang, Xuan
2008-12-01
Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
RBT-GA: a novel metaheuristic for solving the multiple sequence alignment problem
Taheri, Javid; Zomaya, Albert Y
2009-01-01
Background Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. Results This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. Conclusion RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences. PMID:19594869
Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors
NASA Astrophysics Data System (ADS)
Khajeh-Saeed, Ali; Poole, Stephen; Blair Perot, J.
2010-06-01
Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith-Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith-Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith-Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs.
Gatto, Alberto; Torroja-Fungairiño, Carlos; Mazzarotto, Francesco; Cook, Stuart A; Barton, Paul J R; Sánchez-Cabo, Fátima; Lara-Pezzi, Enrique
2014-04-01
Alternative splicing is the main mechanism governing protein diversity. The recent developments in RNA-Seq technology have enabled the study of the global impact and regulation of this biological process. However, the lack of standardized protocols constitutes a major bottleneck in the analysis of alternative splicing. This is particularly important for the identification of exon-exon junctions, which is a critical step in any analysis workflow. Here we performed a systematic benchmarking of alignment tools to dissect the impact of design and method on the mapping, detection and quantification of splice junctions from multi-exon reads. Accordingly, we devised a novel pipeline based on TopHat2 combined with a splice junction detection algorithm, which we have named FineSplice. FineSplice allows effective elimination of spurious junction hits arising from artefactual alignments, achieving up to 99% precision in both real and simulated data sets and yielding superior F1 scores under most tested conditions. The proposed strategy conjugates an efficient mapping solution with a semi-supervised anomaly detection scheme to filter out false positives and allows reliable estimation of expressed junctions from the alignment output. Ultimately this provides more accurate information to identify meaningful splicing patterns. FineSplice is freely available at https://sourceforge.net/p/finesplice/.
ERIC Educational Resources Information Center
Henderson, Susan; Petrosino, Anthony; Guckenburg, Sarah; Hamilton, Stephen
2008-01-01
This technical brief examines whether, after two years of implementation, schools in Massachusetts using quarterly benchmark exams aligned with state standards in middle school mathematics showed greater gains in student achievement than those not doing so. A quasi-experimental design, using covariate matching and comparative interrupted…
Shape-Based Virtual Screening with Volumetric Aligned Molecular Shapes
Koes, David Ryan; Camacho, Carlos J.
2014-01-01
Shape-based virtual screening is an established and effective method for identifying small molecules that are similar in shape and function to a reference ligand. We describe a new method of shape-based virtual screening, volumetric aligned molecular shapes (VAMS). VAMS uses efficient data structures to encode and search molecular shapes. We demonstrate that VAMS is an effective method for shape-based virtual screening and that it can be successfully used as a pre-filter to accelerate more computationally demanding search algorithms. Unique to VAMS is a novel minimum/maximum shape constraint query for precisely specifying the desired molecular shape. Shape constraint searches in VAMS are particularly efficient and millions of shapes can be searched in a fraction of a second. We compare the performance of VAMS with two other shape-based virtual screening algorithms a benchmark of 102 protein targets consisting of more than 32 million molecular shapes and find that VAMS provides a competitive trade-off between run-time performance and virtual screening performance. PMID:25049193
Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.
Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal
2015-06-01
Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.
NASA Astrophysics Data System (ADS)
Stefan Devlin, Benjamin; Nakura, Toru; Ikeda, Makoto; Asada, Kunihiro
We detail a self synchronous field programmable gate array (SSFPGA) with dual-pipeline (DP) architecture to conceal pre-charge time for dynamic logic, and its throughput optimization by using pipeline alignment implemented on benchmark circuits. A self synchronous LUT (SSLUT) consists of a three input tree-type structure with 8bits of SRAM for programming. A self synchronous switch box (SSSB) consists of both pass transistors and buffers to route signals, with 12bits of SRAM. One common block with one SSLUT and one SSSB occupies 2.2Mλ2 area with 35bits of SRAM, and the prototype SSFPGA with 34 × 30 (1020) blocks is designed and fabricated using 65nm CMOS. Measured results show at 1.2V 430MHz and 647MHz operation for a 3bit ripple carry adder, without and with throughput optimization, respectively. We find that using the proposed pipeline alignment techniques we can perform at maximum throughput of 647MHz in various benchmarks on the SSFPGA. We demonstrate up to 56.1 times throughput improvement with our pipeline alignment techniques. The pipeline alignment is carried out within the number of logic elements in the array and pipeline buffers in the switching matrix.
Nielsen, Morten; Lundegaard, Claus; Lund, Ole
2007-01-01
Background Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles. Results The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR), we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors. Conclusion The SMM-align method was shown to outperform other state of the art MHC class II prediction methods. The method predicts quantitative peptide:MHC binding affinity values, making it ideally suited for rational epitope discovery. The method has been trained and evaluated on the, to our knowledge, largest benchmark data set publicly available and covers the nine HLA-DR supertypes suggested as well as three mouse H2-IA allele. Both the peptide benchmark data set, and SMM-align prediction method (NetMHCII) are made publicly available. PMID:17608956
Nielsen, Morten; Lundegaard, Claus; Lund, Ole
2007-07-04
Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles. The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR), we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors. The SMM-align method was shown to outperform other state of the art MHC class II prediction methods. The method predicts quantitative peptide:MHC binding affinity values, making it ideally suited for rational epitope discovery. The method has been trained and evaluated on the, to our knowledge, largest benchmark data set publicly available and covers the nine HLA-DR supertypes suggested as well as three mouse H2-IA allele. Both the peptide benchmark data set, and SMM-align prediction method (NetMHCII) are made publicly available.
ERIC Educational Resources Information Center
Waters, Louise Bay; Vargo, Merrill
2008-01-01
Urban district reform has been hampered by the challenge of understanding and supporting the tremendous complexity of district change. Improving this understanding through actionable, practice-based research is the purpose of this study. The authors began the study with the hypothesis that achieving districts both align their instructional systems…
QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors
Gudyś, Adam; Deorowicz, Sebastian
2014-01-01
Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435
Accelerated Profile HMM Searches
Eddy, Sean R.
2011-01-01
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches. PMID:22039361
Biclustering as a method for RNA local multiple sequence alignment.
Wang, Shu; Gutell, Robin R; Miranker, Daniel P
2007-12-15
Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address. We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions. BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/
Design and realization of the optical and electron beam alignment system for the HUST-FEL oscillator
NASA Astrophysics Data System (ADS)
Fu, Q.; Tan, P.; Liu, K. F.; Qin, B.; Liu, X.
2018-06-01
A Free Electron Laser(FEL) oscillator with radiation wavelength at 30-100 μ m is under commissioning at Huazhong University of Science and Technology (HUST). This work presents the schematic design and realization procedures for the optical and beam alignment system in the HUST FEL facility. The optical cavity misalignment effects are analyzed with the code OPC + Genesis 1.3, and the tolerance of misalignment is proposed with the simulation result. Depending on undulator mechanical benchmarks, a laser indicating system has been built up as reference datum. The alignment of both optical axis and beam trajectory were achieved by this alignment system.
Mapping monomeric threading to protein-protein structure prediction.
Guerler, Aysam; Govindarajoo, Brandon; Zhang, Yang
2013-03-25
The key step of template-based protein-protein structure prediction is the recognition of complexes from experimental structure libraries that have similar quaternary fold. Maintaining two monomer and dimer structure libraries is however laborious, and inappropriate library construction can degrade template recognition coverage. We propose a novel strategy SPRING to identify complexes by mapping monomeric threading alignments to protein-protein interactions based on the original oligomer entries in the PDB, which does not rely on library construction and increases the efficiency and quality of complex template recognitions. SPRING is tested on 1838 nonhomologous protein complexes which can recognize correct quaternary template structures with a TM score >0.5 in 1115 cases after excluding homologous proteins. The average TM score of the first model is 60% and 17% higher than that by HHsearch and COTH, respectively, while the number of targets with an interface RMSD <2.5 Å by SPRING is 134% and 167% higher than these competing methods. SPRING is controlled with ZDOCK on 77 docking benchmark proteins. Although the relative performance of SPRING and ZDOCK depends on the level of homology filters, a combination of the two methods can result in a significantly higher model quality than ZDOCK at all homology thresholds. These data demonstrate a new efficient approach to quaternary structure recognition that is ready to use for genome-scale modeling of protein-protein interactions due to the high speed and accuracy.
Implementation of a parallel protein structure alignment service on cloud.
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
StructAlign, a Program for Alignment of Structures of DNA-Protein Complexes.
Popov, Ya V; Galitsyna, A A; Alexeevski, A V; Karyagina, A S; Spirin, S A
2015-11-01
Comparative analysis of structures of complexes of homologous proteins with DNA is important in the analysis of DNA-protein recognition. Alignment is a necessary stage of the analysis. An alignment is a matching of amino acid residues and nucleotides of one complex to residues and nucleotides of the other. Currently, there are no programs available for aligning structures of DNA-protein complexes. We present the program StructAlign, which should fill this gap. The program inputs a pair of complexes of DNA double helix with proteins and outputs an alignment of DNA chains corresponding to the best spatial fit of the protein chains.
Hashemifar, Somaye; Xu, Jinbo
2014-09-01
High-throughput experimental techniques have produced a large amount of protein-protein interaction (PPI) data. The study of PPI networks, such as comparative analysis, shall benefit the understanding of life process and diseases at the molecular level. One way of comparative analysis is to align PPI networks to identify conserved or species-specific subnetwork motifs. A few methods have been developed for global PPI network alignment, but it still remains challenging in terms of both accuracy and efficiency. This paper presents a novel global network alignment algorithm, denoted as HubAlign, that makes use of both network topology and sequence homology information, based upon the observation that topologically important proteins in a PPI network usually are much more conserved and thus, more likely to be aligned. HubAlign uses a minimum-degree heuristic algorithm to estimate the topological and functional importance of a protein from the global network topology information. Then HubAlign aligns topologically important proteins first and gradually extends the alignment to the whole network. Extensive tests indicate that HubAlign greatly outperforms several popular methods in terms of both accuracy and efficiency, especially in detecting functionally similar proteins. HubAlign is available freely for non-commercial purposes at http://ttic.uchicago.edu/∼hashemifar/software/HubAlign.zip. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.
Bauer, Markus; Klau, Gunnar W; Reinert, Knut
2007-07-27
The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from http://www.planet-lisa.net.
A template-finding algorithm and a comprehensive benchmark for homology modeling of proteins
Vallat, Brinda Kizhakke; Pillardy, Jaroslaw; Elber, Ron
2010-01-01
The first step in homology modeling is to identify a template protein for the target sequence. The template structure is used in later phases of the calculation to construct an atomically detailed model for the target. We have built from the Protein Data Bank a large-scale learning set that includes tens of millions of pair matches that can be either a true template or a false one. Discriminatory learning (learning from positive and negative examples) is employed to train a decision tree. Each branch of the tree is a mathematical programming model. The decision tree is tested on an independent set from PDB entries and on the sequences of CASP7. It provides significant enrichment of true templates (between 50-100 percent) when compared to PSI-BLAST. The model is further verified by building atomically detailed structures for each of the tentative true templates with modeller. The probability that a true match does not yield an acceptable structural model (within 6Å RMSD from the native structure), decays linearly as a function of the TM structural-alignment score. PMID:18300226
Protein Models Docking Benchmark 2
Anishchenko, Ivan; Kundrotas, Petras J.; Tuzikov, Alexander V.; Vakser, Ilya A.
2015-01-01
Structural characterization of protein-protein interactions is essential for our ability to understand life processes. However, only a fraction of known proteins have experimentally determined structures. Such structures provide templates for modeling of a large part of the proteome, where individual proteins can be docked by template-free or template-based techniques. Still, the sensitivity of the docking methods to the inherent inaccuracies of protein models, as opposed to the experimentally determined high-resolution structures, remains largely untested, primarily due to the absence of appropriate benchmark set(s). Structures in such a set should have pre-defined inaccuracy levels and, at the same time, resemble actual protein models in terms of structural motifs/packing. The set should also be large enough to ensure statistical reliability of the benchmarking results. We present a major update of the previously developed benchmark set of protein models. For each interactor, six models were generated with the model-to-native Cα RMSD in the 1 to 6 Å range. The models in the set were generated by a new approach, which corresponds to the actual modeling of new protein structures in the “real case scenario,” as opposed to the previous set, where a significant number of structures were model-like only. In addition, the larger number of complexes (165 vs. 63 in the previous set) increases the statistical reliability of the benchmarking. We estimated the highest accuracy of the predicted complexes (according to CAPRI criteria), which can be attained using the benchmark structures. The set is available at http://dockground.bioinformatics.ku.edu. PMID:25712716
Zhou, Carol L Ecale
2015-01-01
In order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins. Few codes exist for constructing a one-to-many multiple sequence alignment derived from a set of structure or sequence alignments, and a need was evident for creating such a tool for combining pairwise structure alignments that would allow for insertion of gaps in the reference structure. This report describes a new Python code, CombAlign, which takes as input a set of pairwise sequence alignments (which may be structure based) and generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA). The use and utility of CombAlign was demonstrated by generating gapped MSSAs using sets of pairwise structure-based sequence alignments between structure models of the matrix protein (VP40) and pre-small/secreted glycoprotein (sGP) of Reston Ebolavirus and the corresponding proteins of several other filoviruses. The gapped MSSAs revealed structure-based residue-residue correspondences, which enabled identification of structurally similar versus differing regions in the Reston proteins compared to each of the other corresponding proteins. CombAlign is a new Python code that generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA) given a set of pairwise sequence alignments (which may be structure based). CombAlign has utility in assisting the user in distinguishing structurally conserved versus divergent regions on a reference protein structure relative to other closely related proteins. CombAlign was developed in Python 2.6, and the source code is available for download from the GitHub code repository.
Benchmarking protein-protein interface predictions: why you should care about protein size.
Martin, Juliette
2014-07-01
A number of predictive methods have been developed to predict protein-protein binding sites. Each new method is traditionally benchmarked using sets of protein structures of various sizes, and global statistics are used to assess the quality of the prediction. Little attention has been paid to the potential bias due to protein size on these statistics. Indeed, small proteins involve proportionally more residues at interfaces than large ones. If a predictive method is biased toward small proteins, this can lead to an over-estimation of its performance. Here, we investigate the bias due to the size effect when benchmarking protein-protein interface prediction on the widely used docking benchmark 4.0. First, we simulate random scores that favor small proteins over large ones. Instead of the 0.5 AUC (Area Under the Curve) value expected by chance, these biased scores result in an AUC equal to 0.6 using hypergeometric distributions, and up to 0.65 using constant scores. We then use real prediction results to illustrate how to detect the size bias by shuffling, and subsequently correct it using a simple conversion of the scores into normalized ranks. In addition, we investigate the scores produced by eight published methods and show that they are all affected by the size effect, which can change their relative ranking. The size effect also has an impact on linear combination scores by modifying the relative contributions of each method. In the future, systematic corrections should be applied when benchmarking predictive methods using data sets with mixed protein sizes. © 2014 Wiley Periodicals, Inc.
Accelerating Information Retrieval from Profile Hidden Markov Model Databases.
Tamimi, Ahmad; Ashhab, Yaqoub; Tamimi, Hashem
2016-01-01
Profile Hidden Markov Model (Profile-HMM) is an efficient statistical approach to represent protein families. Currently, several databases maintain valuable protein sequence information as profile-HMMs. There is an increasing interest to improve the efficiency of searching Profile-HMM databases to detect sequence-profile or profile-profile homology. However, most efforts to enhance searching efficiency have been focusing on improving the alignment algorithms. Although the performance of these algorithms is fairly acceptable, the growing size of these databases, as well as the increasing demand for using batch query searching approach, are strong motivations that call for further enhancement of information retrieval from profile-HMM databases. This work presents a heuristic method to accelerate the current profile-HMM homology searching approaches. The method works by cluster-based remodeling of the database to reduce the search space, rather than focusing on the alignment algorithms. Using different clustering techniques, 4284 TIGRFAMs profiles were clustered based on their similarities. A representative for each cluster was assigned. To enhance sensitivity, we proposed an extended step that allows overlapping among clusters. A validation benchmark of 6000 randomly selected protein sequences was used to query the clustered profiles. To evaluate the efficiency of our approach, speed and recall values were measured and compared with the sequential search approach. Using hierarchical, k-means, and connected component clustering techniques followed by the extended overlapping step, we obtained an average reduction in time of 41%, and an average recall of 96%. Our results demonstrate that representation of profile-HMMs using a clustering-based approach can significantly accelerate data retrieval from profile-HMM databases.
DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.
Ma, Wenxiu; Yang, Lin; Rohs, Remo; Noble, William Stafford
2017-10-01
Transcription factors (TFs) bind to specific DNA sequence motifs. Several lines of evidence suggest that TF-DNA binding is mediated in part by properties of the local DNA shape: the width of the minor groove, the relative orientations of adjacent base pairs, etc. Several methods have been developed to jointly account for DNA sequence and shape properties in predicting TF binding affinity. However, a limitation of these methods is that they typically require a training set of aligned TF binding sites. We describe a sequence + shape kernel that leverages DNA sequence and shape information to better understand protein-DNA binding preference and affinity. This kernel extends an existing class of k-mer based sequence kernels, based on the recently described di-mismatch kernel. Using three in vitro benchmark datasets, derived from universal protein binding microarrays (uPBMs), genomic context PBMs (gcPBMs) and SELEX-seq data, we demonstrate that incorporating DNA shape information improves our ability to predict protein-DNA binding affinity. In particular, we observe that (i) the k-spectrum + shape model performs better than the classical k-spectrum kernel, particularly for small k values; (ii) the di-mismatch kernel performs better than the k-mer kernel, for larger k; and (iii) the di-mismatch + shape kernel performs better than the di-mismatch kernel for intermediate k values. The software is available at https://bitbucket.org/wenxiu/sequence-shape.git. rohs@usc.edu or william-noble@uw.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Zhang, Chengxin; Zheng, Wei; Freddolino, Peter L; Zhang, Yang
2018-03-10
Homology-based transferal remains the major approach to computational protein function annotations, but it becomes increasingly unreliable when the sequence identity between query and template decreases below 30%. We propose a novel pipeline, MetaGO, to deduce Gene Ontology attributes of proteins by combining sequence homology-based annotation with low-resolution structure prediction and comparison, and partner's homology-based protein-protein network mapping. The pipeline was tested on a large-scale set of 1000 non-redundant proteins from the CAFA3 experiment. Under the stringent benchmark conditions where templates with >30% sequence identity to the query are excluded, MetaGO achieves average F-measures of 0.487, 0.408, and 0.598, for Molecular Function, Biological Process, and Cellular Component, respectively, which are significantly higher than those achieved by other state-of-the-art function annotations methods. Detailed data analysis shows that the major advantage of the MetaGO lies in the new functional homolog detections from partner's homology-based network mapping and structure-based local and global structure alignments, the confidence scores of which can be optimally combined through logistic regression. These data demonstrate the power of using a hybrid model incorporating protein structure and interaction networks to deduce new functional insights beyond traditional sequence homology-based referrals, especially for proteins that lack homologous function templates. The MetaGO pipeline is available at http://zhanglab.ccmb.med.umich.edu/MetaGO/. Copyright © 2018. Published by Elsevier Ltd.
Douzery, Emmanuel J P; Scornavacca, Celine; Romiguier, Jonathan; Belkhir, Khalid; Galtier, Nicolas; Delsuc, Frédéric; Ranwez, Vincent
2014-07-01
Comparative genomic studies extensively rely on alignments of orthologous sequences. Yet, selecting, gathering, and aligning orthologous exons and protein-coding sequences (CDS) that are relevant for a given evolutionary analysis can be a difficult and time-consuming task. In this context, we developed OrthoMaM, a database of ORTHOlogous MAmmalian Markers describing the evolutionary dynamics of orthologous genes in mammalian genomes using a phylogenetic framework. Since its first release in 2007, OrthoMaM has regularly evolved, not only to include newly available genomes but also to incorporate up-to-date software in its analytic pipeline. This eighth release integrates the 40 complete mammalian genomes available in Ensembl v73 and provides alignments, phylogenies, evolutionary descriptor information, and functional annotations for 13,404 single-copy orthologous CDS and 6,953 long exons. The graphical interface allows to easily explore OrthoMaM to identify markers with specific characteristics (e.g., taxa availability, alignment size, %G+C, evolutionary rate, chromosome location). It hence provides an efficient solution to sample preprocessed markers adapted to user-specific needs. OrthoMaM has proven to be a valuable resource for researchers interested in mammalian phylogenomics, evolutionary genomics, and has served as a source of benchmark empirical data sets in several methodological studies. OrthoMaM is available for browsing, query and complete or filtered downloads at http://www.orthomam.univ-montp2.fr/. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Sael, Lee; Kihara, Daisuke
2012-01-01
Functional elucidation of proteins is one of the essential tasks in biology. Function of a protein, specifically, small ligand molecules that bind to a protein, can be predicted by finding similar local surface regions in binding sites of known proteins. Here, we developed an alignment free local surface comparison method for predicting a ligand molecule which binds to a query protein. The algorithm, named Patch-Surfer, represents a binding pocket as a combination of segmented surface patches, each of which is characterized by its geometrical shape, the electrostatic potential, the hydrophobicity, and the concaveness. Representing a pocket by a set of patches is effective to absorb difference of global pocket shape while capturing local similarity of pockets. The shape and the physicochemical properties of surface patches are represented using the 3D Zernike descriptor, which is a series expansion of mathematical 3D function. Two pockets are compared using a modified weighted bipartite matching algorithm, which matches similar patches from the two pockets. Patch-Surfer was benchmarked on three datasets, which consist in total of 390 proteins that bind to one of 21 ligands. Patch-Surfer showed superior performance to existing methods including a global pocket comparison method, Pocket-Surfer, which we have previously introduced. Particularly, as intended, the accuracy showed large improvement for flexible ligand molecules, which bind to pockets in different conformations. PMID:22275074
Sael, Lee; Kihara, Daisuke
2012-04-01
Functional elucidation of proteins is one of the essential tasks in biology. Function of a protein, specifically, small ligand molecules that bind to a protein, can be predicted by finding similar local surface regions in binding sites of known proteins. Here, we developed an alignment free local surface comparison method for predicting a ligand molecule which binds to a query protein. The algorithm, named Patch-Surfer, represents a binding pocket as a combination of segmented surface patches, each of which is characterized by its geometrical shape, the electrostatic potential, the hydrophobicity, and the concaveness. Representing a pocket by a set of patches is effective to absorb difference of global pocket shape while capturing local similarity of pockets. The shape and the physicochemical properties of surface patches are represented using the 3D Zernike descriptor, which is a series expansion of mathematical 3D function. Two pockets are compared using a modified weighted bipartite matching algorithm, which matches similar patches from the two pockets. Patch-Surfer was benchmarked on three datasets, which consist in total of 390 proteins that bind to one of 21 ligands. Patch-Surfer showed superior performance to existing methods including a global pocket comparison method, Pocket-Surfer, which we have previously introduced. Particularly, as intended, the accuracy showed large improvement for flexible ligand molecules, which bind to pockets in different conformations. Copyright © 2011 Wiley Periodicals, Inc.
Unified Alignment of Protein-Protein Interaction Networks.
Malod-Dognin, Noël; Ban, Kristina; Pržulj, Nataša
2017-04-19
Paralleling the increasing availability of protein-protein interaction (PPI) network data, several network alignment methods have been proposed. Network alignments have been used to uncover functionally conserved network parts and to transfer annotations. However, due to the computational intractability of the network alignment problem, aligners are heuristics providing divergent solutions and no consensus exists on a gold standard, or which scoring scheme should be used to evaluate them. We comprehensively evaluate the alignment scoring schemes and global network aligners on large scale PPI data and observe that three methods, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. We study the collective behaviour of network aligners and observe that PPI networks are almost entirely aligned with a handful of aligners that we unify into a new tool, Ulign. Ulign enables complete alignment of two networks, which traditional global and local aligners fail to do. Also, multiple mappings of Ulign define biologically relevant soft clusterings of proteins in PPI networks, which may be used for refining the transfer of annotations across networks. Hence, PPI networks are already well investigated by current aligners, so to gain additional biological insights, a paradigm shift is needed. We propose such a shift come from aligning all available data types collectively rather than any particular data type in isolation from others.
Accelerating large-scale protein structure alignments with graphics processing units
2012-01-01
Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU. PMID:22357132
AlignMe—a membrane protein sequence alignment web server
Stamm, Marcus; Staritzbichler, René; Khafizov, Kamil; Forrest, Lucy R.
2014-01-01
We present a web server for pair-wise alignment of membrane protein sequences, using the program AlignMe. The server makes available two operational modes of AlignMe: (i) sequence to sequence alignment, taking two sequences in fasta format as input, combining information about each sequence from multiple sources and producing a pair-wise alignment (PW mode); and (ii) alignment of two multiple sequence alignments to create family-averaged hydropathy profile alignments (HP mode). For the PW sequence alignment mode, four different optimized parameter sets are provided, each suited to pairs of sequences with a specific similarity level. These settings utilize different types of inputs: (position-specific) substitution matrices, secondary structure predictions and transmembrane propensities from transmembrane predictions or hydrophobicity scales. In the second (HP) mode, each input multiple sequence alignment is converted into a hydrophobicity profile averaged over the provided set of sequence homologs; the two profiles are then aligned. The HP mode enables qualitative comparison of transmembrane topologies (and therefore potentially of 3D folds) of two membrane proteins, which can be useful if the proteins have low sequence similarity. In summary, the AlignMe web server provides user-friendly access to a set of tools for analysis and comparison of membrane protein sequences. Access is available at http://www.bioinfo.mpg.de/AlignMe PMID:24753425
Alignment of angular velocity sensors for a vestibular prosthesis.
Digiovanna, Jack; Carpaneto, Jacopo; Micera, Silvestro; Merfeld, Daniel M
2012-02-13
Vestibular prosthetics transmit angular velocities to the nervous system via electrical stimulation. Head-fixed gyroscopes measure angular motion, but the gyroscope coordinate system will not be coincident with the sensory organs the prosthetic replaces. Here we show a simple calibration method to align gyroscope measurements with the anatomical coordinate system. We benchmarked the method with simulated movements and obtain proof-of-concept with one healthy subject. The method was robust to misalignment, required little data, and minimal processing.
BatMis: a fast algorithm for k-mismatch mapping.
Tennakoon, Chandana; Purbojati, Rikky W; Sung, Wing-Kin
2012-08-15
Second-generation sequencing (SGS) generates millions of reads that need to be aligned to a reference genome allowing errors. Although current aligners can efficiently map reads allowing a small number of mismatches, they are not well suited for handling a large number of mismatches. The efficiency of aligners can be improved using various heuristics, but the sensitivity and accuracy of the alignments are sacrificed. In this article, we introduce Basic Alignment tool for Mismatches (BatMis)--an efficient method to align short reads to a reference allowing k mismatches. BatMis is a Burrows-Wheeler transformation based aligner that uses a seed and extend approach, and it is an exact method. Benchmark tests show that BatMis performs better than competing aligners in solving the k-mismatch problem. Furthermore, it can compete favorably even when compared with the heuristic modes of the other aligners. BatMis is a useful alternative for applications where fast k-mismatch mappings, unique mappings or multiple mappings of SGS data are required. BatMis is written in C/C++ and is freely available from http://code.google.com/p/batmis/
Benchmarking short sequence mapping tools
2013-01-01
Background The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison. Results We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others. Conclusion The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results. PMID:23758764
Roca, Alberto I
2014-01-01
The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position in the aligned protein family. The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for comparison with the Sequence Logo reference images. The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid software is free and available from http://www.ProfileGrid.org.
Ó Conchúir, Shane; Barlow, Kyle A; Pache, Roland A; Ollikainen, Noah; Kundert, Kale; O'Meara, Matthew J; Smith, Colin A; Kortemme, Tanja
2015-01-01
The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.
2014-01-01
Background The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position in the aligned protein family. Results The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for comparison with the Sequence Logo reference images. Conclusions The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid software is free and available from http://www.ProfileGrid.org. PMID:25237393
Jessen, Leon Eyrich; Hoof, Ilka; Lund, Ole; Nielsen, Morten
2013-07-01
Identifying which mutation(s) within a given genotype is responsible for an observable phenotype is important in many aspects of molecular biology. Here, we present SigniSite, an online application for subgroup-free residue-level genotype-phenotype correlation. In contrast to similar methods, SigniSite does not require any pre-definition of subgroups or binary classification. Input is a set of protein sequences where each sequence has an associated real number, quantifying a given phenotype. SigniSite will then identify which amino acid residues are significantly associated with the data set phenotype. As output, SigniSite displays a sequence logo, depicting the strength of the phenotype association of each residue and a heat-map identifying 'hot' or 'cold' regions. SigniSite was benchmarked against SPEER, a state-of-the-art method for the prediction of specificity determining positions (SDP) using a set of human immunodeficiency virus protease-inhibitor genotype-phenotype data and corresponding resistance mutation scores from the Stanford University HIV Drug Resistance Database, and a data set of protein families with experimentally annotated SDPs. For both data sets, SigniSite was found to outperform SPEER. SigniSite is available at: http://www.cbs.dtu.dk/services/SigniSite/.
BAYESIAN PROTEIN STRUCTURE ALIGNMENT.
Rodriguez, Abel; Schmidler, Scott C
The analysis of the three-dimensional structure of proteins is an important topic in molecular biochemistry. Structure plays a critical role in defining the function of proteins and is more strongly conserved than amino acid sequence over evolutionary timescales. A key challenge is the identification and evaluation of structural similarity between proteins; such analysis can aid in understanding the role of newly discovered proteins and help elucidate evolutionary relationships between organisms. Computational biologists have developed many clever algorithmic techniques for comparing protein structures, however, all are based on heuristic optimization criteria, making statistical interpretation somewhat difficult. Here we present a fully probabilistic framework for pairwise structural alignment of proteins. Our approach has several advantages, including the ability to capture alignment uncertainty and to estimate key "gap" parameters which critically affect the quality of the alignment. We show that several existing alignment methods arise as maximum a posteriori estimates under specific choices of prior distributions and error models. Our probabilistic framework is also easily extended to incorporate additional information, which we demonstrate by including primary sequence information to generate simultaneous sequence-structure alignments that can resolve ambiguities obtained using structure alone. This combined model also provides a natural approach for the difficult task of estimating evolutionary distance based on structural alignments. The model is illustrated by comparison with well-established methods on several challenging protein alignment examples.
LiveBench-1: continuous benchmarking of protein structure prediction servers.
Bujnicki, J M; Elofsson, A; Fischer, D; Rychlewski, L
2001-02-01
We present a novel, continuous approach aimed at the large-scale assessment of the performance of available fold-recognition servers. Six popular servers were investigated: PDB-Blast, FFAS, T98-lib, GenTHREADER, 3D-PSSM, and INBGU. The assessment was conducted using as prediction targets a large number of selected protein structures released from October 1999 to April 2000. A target was selected if its sequence showed no significant similarity to any of the proteins previously available in the structural database. Overall, the servers were able to produce structurally similar models for one-half of the targets, but significantly accurate sequence-structure alignments were produced for only one-third of the targets. We further classified the targets into two sets: easy and hard. We found that all servers were able to find the correct answer for the vast majority of the easy targets if a structurally similar fold was present in the server's fold libraries. However, among the hard targets--where standard methods such as PSI-BLAST fail--the most sensitive fold-recognition servers were able to produce similar models for only 40% of the cases, half of which had a significantly accurate sequence-structure alignment. Among the hard targets, the presence of updated libraries appeared to be less critical for the ranking. An "ideally combined consensus" prediction, where the results of all servers are considered, would increase the percentage of correct assignments by 50%. Each server had a number of cases with a correct assignment, where the assignments of all the other servers were wrong. This emphasizes the benefits of considering more than one server in difficult prediction tasks. The LiveBench program (http://BioInfo.PL/LiveBench) is being continued, and all interested developers are cordially invited to join.
Optimal parameters for arterial repair using light-activated surgical adhesives.
Soller, Eric C; Hoffman, Grant T; McNally-Heintzelman, Karen M
2003-01-01
The clinical acceptance of laser-tissue repair techniques is dependent on the reproducibility of viable repairs. Reproducibility is dependent on two factors: (i) the choice of materials to be used as the adhesive; and (ii) obtaining temperatures high enough to cause protein denaturation at the vital tissue interface without causing excessive thermal damage to the surrounding tissue. The use of a polymer scaffold as a carrier for the protein solder provides for uniform application of the solder to the tissue, thus allowing for pre-selection of optimal laser parameters. The scaffold also facilitates precise tissue alignment and ease of clinical application. In addition, the scaffold can be doped with various pharmaceuticals such as hemostatic and thrombogenic agents to aid wound healing. An ex vivo study was performed to correlate solder and tissue temperature with the tensile strength of arterial repairs formed using scaffold-enhanced light-activated surgical adhesives. Previous studies by our group using solid protein solder without the scaffold indicate that a solder/tissue, interface temperature of 65 degrees C is optimal. Using this parameter as a benchmark, laser irradiance was varied and temperatures were recorded at the surface and at the tissue interface of scaffold-enhanced protein solder using an infrared temperature monitoring system, designed by the researchers, and a type-K thermocouple, respectively.
A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments.
Rajan, Vaibhav
2013-03-01
Inaccurate inference of positional homologies in multiple sequence alignments and systematic errors introduced by alignment heuristics obfuscate phylogenetic inference. Alignment masking, the elimination of phylogenetically uninformative or misleading sites from an alignment before phylogenetic analysis, is a common practice in phylogenetic analysis. Although masking is often done manually, automated methods are necessary to handle the much larger data sets being prepared today. In this study, we introduce the concept of subsplits and demonstrate their use in extracting phylogenetic signal from alignments. We design a clustering approach for alignment masking where each cluster contains similar columns-similarity being defined on the basis of compatible subsplits; our approach then identifies noisy clusters and eliminates them. Trees inferred from the columns in the retained clusters are found to be topologically closer to the reference trees. We test our method on numerous standard benchmarks (both synthetic and biological data sets) and compare its performance with other methods of alignment masking. We find that our method can eliminate sites more accurately than other methods, particularly on divergent data, and can improve the topologies of the inferred trees in likelihood-based analyses. Software available upon request from the author.
34 CFR 300.320 - Definition of individualized education program.
Code of Federal Regulations, 2011 CFR
2011-07-01
... of the child's present levels of academic achievement and functional performance, including— (i) How... statement of measurable annual goals, including academic and functional goals designed to— (A) Meet the... aligned to alternate academic achievement standards, a description of benchmarks or short-term objectives...
Hu, Jialu; Kehr, Birte; Reinert, Knut
2014-02-15
Owing to recent advancements in high-throughput technologies, protein-protein interaction networks of more and more species become available in public databases. The question of how to identify functionally conserved proteins across species attracts a lot of attention in computational biology. Network alignments provide a systematic way to solve this problem. However, most existing alignment tools encounter limitations in tackling this problem. Therefore, the demand for faster and more efficient alignment tools is growing. We present a fast and accurate algorithm, NetCoffee, which allows to find a global alignment of multiple protein-protein interaction networks. NetCoffee searches for a global alignment by maximizing a target function using simulated annealing on a set of weighted bipartite graphs that are constructed using a triplet approach similar to T-Coffee. To assess its performance, NetCoffee was applied to four real datasets. Our results suggest that NetCoffee remedies several limitations of previous algorithms, outperforms all existing alignment tools in terms of speed and nevertheless identifies biologically meaningful alignments. The source code and data are freely available for download under the GNU GPL v3 license at https://code.google.com/p/netcoffee/.
PANDA: Protein function prediction using domain architecture and affinity propagation.
Wang, Zheng; Zhao, Chenguang; Wang, Yiheng; Sun, Zheng; Wang, Nan
2018-02-22
We developed PANDA (Propagation of Affinity and Domain Architecture) to predict protein functions in the format of Gene Ontology (GO) terms. PANDA at first executes profile-profile alignment algorithm to search against PfamA, KOG, COG, and SwissProt databases, and then launches PSI-BLAST against UniProt for homologue search. PANDA integrates a domain architecture inference algorithm based on the Bayesian statistics that calculates the probability of having a GO term. All the candidate GO terms are pooled and filtered based on Z-score. After that, the remaining GO terms are clustered using an affinity propagation algorithm based on the GO directed acyclic graph, followed by a second round of filtering on the clusters of GO terms. We benchmarked the performance of all the baseline predictors PANDA integrates and also for every pooling and filtering step of PANDA. It can be found that PANDA achieves better performances in terms of area under the curve for precision and recall compared to the baseline predictors. PANDA can be accessed from http://dna.cs.miami.edu/PANDA/ .
Historian: accurate reconstruction of ancestral sequences and evolutionary rates.
Holmes, Ian H
2017-04-15
Reconstruction of ancestral sequence histories, and estimation of parameters like indel rates, are improved by using explicit evolutionary models and summing over uncertain alignments. The previous best tool for this purpose (according to simulation benchmarks) was ProtPal, but this tool was too slow for practical use. Historian combines an efficient reimplementation of the ProtPal algorithm with performance-improving heuristics from other alignment tools. Simulation results on fidelity of rate estimation via ancestral reconstruction, along with evaluations on the structurally informed alignment dataset BAliBase 3.0, recommend Historian over other alignment tools for evolutionary applications. Historian is available at https://github.com/evoldoers/historian under the Creative Commons Attribution 3.0 US license. ihholmes+historian@gmail.com. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
SFESA: a web server for pairwise alignment refinement by secondary structure shifts.
Tong, Jing; Pei, Jimin; Grishin, Nick V
2015-09-03
Protein sequence alignment is essential for a variety of tasks such as homology modeling and active site prediction. Alignment errors remain the main cause of low-quality structure models. A bioinformatics tool to refine alignments is needed to make protein alignments more accurate. We developed the SFESA web server to refine pairwise protein sequence alignments. Compared to the previous version of SFESA, which required a set of 3D coordinates for a protein, the new server will search a sequence database for the closest homolog with an available 3D structure to be used as a template. For each alignment block defined by secondary structure elements in the template, SFESA evaluates alignment variants generated by local shifts and selects the best-scoring alignment variant. A scoring function that combines the sequence score of profile-profile comparison and the structure score of template-derived contact energy is used for evaluation of alignments. PROMALS pairwise alignments refined by SFESA are more accurate than those produced by current advanced alignment methods such as HHpred and CNFpred. In addition, SFESA also improves alignments generated by other software. SFESA is a web-based tool for alignment refinement, designed for researchers to compute, refine, and evaluate pairwise alignments with a combined sequence and structure scoring of alignment blocks. To our knowledge, the SFESA web server is the only tool that refines alignments by evaluating local shifts of secondary structure elements. The SFESA web server is available at http://prodata.swmed.edu/sfesa.
Jones, David T; Singh, Tanya; Kosciolek, Tomasz; Tetchner, Stuart
2015-04-01
Recent developments of statistical techniques to infer direct evolutionary couplings between residue pairs have rendered covariation-based contact prediction a viable means for accurate 3D modelling of proteins, with no information other than the sequence required. To extend the usefulness of contact prediction, we have designed a new meta-predictor (MetaPSICOV) which combines three distinct approaches for inferring covariation signals from multiple sequence alignments, considers a broad range of other sequence-derived features and, uniquely, a range of metrics which describe both the local and global quality of the input multiple sequence alignment. Finally, we use a two-stage predictor, where the second stage filters the output of the first stage. This two-stage predictor is additionally evaluated on its ability to accurately predict the long range network of hydrogen bonds, including correctly assigning the donor and acceptor residues. Using the original PSICOV benchmark set of 150 protein families, MetaPSICOV achieves a mean precision of 0.54 for top-L predicted long range contacts-around 60% higher than PSICOV, and around 40% better than CCMpred. In de novo protein structure prediction using FRAGFOLD, MetaPSICOV is able to improve the TM-scores of models by a median of 0.05 compared with PSICOV. Lastly, for predicting long range hydrogen bonding, MetaPSICOV-HB achieves a precision of 0.69 for the top-L/10 hydrogen bonds compared with just 0.26 for the baseline MetaPSICOV. MetaPSICOV is available as a freely available web server at http://bioinf.cs.ucl.ac.uk/MetaPSICOV. Raw data (predicted contact lists and 3D models) and source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/MetaPSICOV. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Student Interactives--A new Tool for Exploring Science.
NASA Astrophysics Data System (ADS)
Turner, C.
2005-05-01
Science NetLinks (SNL), a national program that provides online teacher resources created by the American Association for the Advancement of Science (AAAS), has proven to be a leader among educational resource providers in bringing free, high-quality, grade-appropriate materials to the national teaching community in a format that facilitates classroom integration. Now in its ninth year on the Web, Science NetLinks is part of the MarcoPolo Consortium of Web sites and associated state-based training initiatives that help teachers integrate Internet content into the classroom. SNL is a national presence in the K-12 science education community serving over 700,000 teachers each year, who visit the site at least three times a month. SNL features: High-quality, innovative, original lesson plans aligned to Project 2061 Benchmarks for Science Literacy, Original Internet-based interactives and learning challenges, Reviewed Web resources and demonstrations, Award winning, 60-second audio news features (Science Updates). Science NetLinks has an expansive and growing library of this educational material, aligned and sortable by grade band or benchmark. The program currently offers over 500 lessons, covering 72% of the Benchmarks for Science Literacy content areas in grades K-12. Over the past several years, there has been a strong movement to create online resources that support earth and space science education. Funding for various online educational materials has been available from many sources and has produced a variety of useful products for the education community. Teachers, through the Internet, potentially have access to thousands of activities, lessons and multimedia interactive applications for use in the classroom. But, with so many resources available, it is increasingly more difficult for educators to locate quality resources that are aligned to standards and learning goals. To ensure that the education community utilizes the resources, the material must conform to a format that allows easy understanding, evaluation and integration. Science NetLinks' material has been proven to satisfy these criteria and serve thousands of teachers every year. All online interactive materials that are created by AAAS are aligned to AAAS Project 2061 Benchmarks, which mirror National Science Standards, and are developed based on a rigorous set of criteria. For the purpose of this forum we will provide an overview that explains the need for more of these materials in the earth and space education, a review of the criteria for creating these materials and show examples of online materials created by AAAS that support earth and space science.
Iterative non-sequential protein structural alignment.
Salem, Saeed; Zaki, Mohammed J; Bystroff, Christopher
2009-06-01
Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.
A Review of Flood Loss Models as Basis for Harmonization and Benchmarking
Kreibich, Heidi; Franco, Guillermo; Marechal, David
2016-01-01
Risk-based approaches have been increasingly accepted and operationalized in flood risk management during recent decades. For instance, commercial flood risk models are used by the insurance industry to assess potential losses, establish the pricing of policies and determine reinsurance needs. Despite considerable progress in the development of loss estimation tools since the 1980s, loss estimates still reflect high uncertainties and disparities that often lead to questioning their quality. This requires an assessment of the validity and robustness of loss models as it affects prioritization and investment decision in flood risk management as well as regulatory requirements and business decisions in the insurance industry. Hence, more effort is needed to quantify uncertainties and undertake validations. Due to a lack of detailed and reliable flood loss data, first order validations are difficult to accomplish, so that model comparisons in terms of benchmarking are essential. It is checked if the models are informed by existing data and knowledge and if the assumptions made in the models are aligned with the existing knowledge. When this alignment is confirmed through validation or benchmarking exercises, the user gains confidence in the models. Before these benchmarking exercises are feasible, however, a cohesive survey of existing knowledge needs to be undertaken. With that aim, this work presents a review of flood loss–or flood vulnerability–relationships collected from the public domain and some professional sources. Our survey analyses 61 sources consisting of publications or software packages, of which 47 are reviewed in detail. This exercise results in probably the most complete review of flood loss models to date containing nearly a thousand vulnerability functions. These functions are highly heterogeneous and only about half of the loss models are found to be accompanied by explicit validation at the time of their proposal. This paper exemplarily presents an approach for a quantitative comparison of disparate models via the reduction to the joint input variables of all models. Harmonization of models for benchmarking and comparison requires profound insight into the model structures, mechanisms and underlying assumptions. Possibilities and challenges are discussed that exist in model harmonization and the application of the inventory in a benchmarking framework. PMID:27454604
A Review of Flood Loss Models as Basis for Harmonization and Benchmarking.
Gerl, Tina; Kreibich, Heidi; Franco, Guillermo; Marechal, David; Schröter, Kai
2016-01-01
Risk-based approaches have been increasingly accepted and operationalized in flood risk management during recent decades. For instance, commercial flood risk models are used by the insurance industry to assess potential losses, establish the pricing of policies and determine reinsurance needs. Despite considerable progress in the development of loss estimation tools since the 1980s, loss estimates still reflect high uncertainties and disparities that often lead to questioning their quality. This requires an assessment of the validity and robustness of loss models as it affects prioritization and investment decision in flood risk management as well as regulatory requirements and business decisions in the insurance industry. Hence, more effort is needed to quantify uncertainties and undertake validations. Due to a lack of detailed and reliable flood loss data, first order validations are difficult to accomplish, so that model comparisons in terms of benchmarking are essential. It is checked if the models are informed by existing data and knowledge and if the assumptions made in the models are aligned with the existing knowledge. When this alignment is confirmed through validation or benchmarking exercises, the user gains confidence in the models. Before these benchmarking exercises are feasible, however, a cohesive survey of existing knowledge needs to be undertaken. With that aim, this work presents a review of flood loss-or flood vulnerability-relationships collected from the public domain and some professional sources. Our survey analyses 61 sources consisting of publications or software packages, of which 47 are reviewed in detail. This exercise results in probably the most complete review of flood loss models to date containing nearly a thousand vulnerability functions. These functions are highly heterogeneous and only about half of the loss models are found to be accompanied by explicit validation at the time of their proposal. This paper exemplarily presents an approach for a quantitative comparison of disparate models via the reduction to the joint input variables of all models. Harmonization of models for benchmarking and comparison requires profound insight into the model structures, mechanisms and underlying assumptions. Possibilities and challenges are discussed that exist in model harmonization and the application of the inventory in a benchmarking framework.
Functional classification of protein structures by local structure matching in graph representation.
Mills, Caitlyn L; Garg, Rohan; Lee, Joslynn S; Tian, Liang; Suciu, Alexandru; Cooperman, Gene; Beuning, Penny J; Ondrechen, Mary Jo
2018-03-31
As a result of high-throughput protein structure initiatives, over 14,400 protein structures have been solved by structural genomics (SG) centers and participating research groups. While the totality of SG data represents a tremendous contribution to genomics and structural biology, reliable functional information for these proteins is generally lacking. Better functional predictions for SG proteins will add substantial value to the structural information already obtained. Our method described herein, Graph Representation of Active Sites for Prediction of Function (GRASP-Func), predicts quickly and accurately the biochemical function of proteins by representing residues at the predicted local active site as graphs rather than in Cartesian coordinates. We compare the GRASP-Func method to our previously reported method, structurally aligned local sites of activity (SALSA), using the ribulose phosphate binding barrel (RPBB), 6-hairpin glycosidase (6-HG), and Concanavalin A-like Lectins/Glucanase (CAL/G) superfamilies as test cases. In each of the superfamilies, SALSA and the much faster method GRASP-Func yield similar correct classification of previously characterized proteins, providing a validated benchmark for the new method. In addition, we analyzed SG proteins using our SALSA and GRASP-Func methods to predict function. Forty-one SG proteins in the RPBB superfamily, nine SG proteins in the 6-HG superfamily, and one SG protein in the CAL/G superfamily were successfully classified into one of the functional families in their respective superfamily by both methods. This improved, faster, validated computational method can yield more reliable predictions of function that can be used for a wide variety of applications by the community. © 2018 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Kann, Maricel G.; Sheetlin, Sergey L.; Park, Yonil; Bryant, Stephen H.; Spouge, John L.
2007-01-01
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance. PMID:17596268
PASS2: an automated database of protein alignments organised as structural superfamilies.
Bhaduri, Anirban; Pugalenthi, Ganesan; Sowdhamini, Ramanathan
2004-04-02
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
Accurate protein structure modeling using sparse NMR data and homologous structure information.
Thompson, James M; Sgourakis, Nikolaos G; Liu, Gaohua; Rossi, Paolo; Tang, Yuefeng; Mills, Jeffrey L; Szyperski, Thomas; Montelione, Gaetano T; Baker, David
2012-06-19
While information from homologous structures plays a central role in X-ray structure determination by molecular replacement, such information is rarely used in NMR structure determination because it can be incorrect, both locally and globally, when evolutionary relationships are inferred incorrectly or there has been considerable evolutionary structural divergence. Here we describe a method that allows robust modeling of protein structures of up to 225 residues by combining (1)H(N), (13)C, and (15)N backbone and (13)Cβ chemical shift data, distance restraints derived from homologous structures, and a physically realistic all-atom energy function. Accurate models are distinguished from inaccurate models generated using incorrect sequence alignments by requiring that (i) the all-atom energies of models generated using the restraints are lower than models generated in unrestrained calculations and (ii) the low-energy structures converge to within 2.0 Å backbone rmsd over 75% of the protein. Benchmark calculations on known structures and blind targets show that the method can accurately model protein structures, even with very remote homology information, to a backbone rmsd of 1.2-1.9 Å relative to the conventional determined NMR ensembles and of 0.9-1.6 Å relative to X-ray structures for well-defined regions of the protein structures. This approach facilitates the accurate modeling of protein structures using backbone chemical shift data without need for side-chain resonance assignments and extensive analysis of NOESY cross-peak assignments.
Genetically improved BarraCUDA.
Langdon, W B; Lam, Brian Yee Hong
2017-01-01
BarraCUDA is an open source C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. Recently its source code was optimised using "Genetic Improvement". The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60% more accurate on a short BioPlanet.com GCAT alignment benchmark. GPGPU BarraCUDA running on a single K80 Tesla GPU can align short paired end nextGen sequences up to ten times faster than bwa on a 12 core server. The speed up was such that the GI version was adopted and has been regularly downloaded from SourceForge for more than 12 months.
Aligning Biomolecular Networks Using Modular Graph Kernels
NASA Astrophysics Data System (ADS)
Towfic, Fadi; Greenlee, M. Heather West; Honavar, Vasant
Comparative analysis of biomolecular networks constructed using measurements from different conditions, tissues, and organisms offer a powerful approach to understanding the structure, function, dynamics, and evolution of complex biological systems. We explore a class of algorithms for aligning large biomolecular networks by breaking down such networks into subgraphs and computing the alignment of the networks based on the alignment of their subgraphs. The resulting subnetworks are compared using graph kernels as scoring functions. We provide implementations of the resulting algorithms as part of BiNA, an open source biomolecular network alignment toolkit. Our experiments using Drosophila melanogaster, Saccharomyces cerevisiae, Mus musculus and Homo sapiens protein-protein interaction networks extracted from the DIP repository of protein-protein interaction data demonstrate that the performance of the proposed algorithms (as measured by % GO term enrichment of subnetworks identified by the alignment) is competitive with some of the state-of-the-art algorithms for pair-wise alignment of large protein-protein interaction networks. Our results also show that the inter-species similarity scores computed based on graph kernels can be used to cluster the species into a species tree that is consistent with the known phylogenetic relationships among the species.
Optimal network alignment with graphlet degree vectors.
Milenković, Tijana; Ng, Weng Leong; Hayes, Wayne; Przulj, Natasa
2010-06-30
Important biological information is encoded in the topology of biological networks. Comparative analyses of biological networks are proving to be valuable, as they can lead to transfer of knowledge between species and give deeper insights into biological function, disease, and evolution. We introduce a new method that uses the Hungarian algorithm to produce optimal global alignment between two networks using any cost function. We design a cost function based solely on network topology and use it in our network alignment. Our method can be applied to any two networks, not just biological ones, since it is based only on network topology. We use our new method to align protein-protein interaction networks of two eukaryotic species and demonstrate that our alignment exposes large and topologically complex regions of network similarity. At the same time, our alignment is biologically valid, since many of the aligned protein pairs perform the same biological function. From the alignment, we predict function of yet unannotated proteins, many of which we validate in the literature. Also, we apply our method to find topological similarities between metabolic networks of different species and build phylogenetic trees based on our network alignment score. The phylogenetic trees obtained in this way bear a striking resemblance to the ones obtained by sequence alignments. Our method detects topologically similar regions in large networks that are statistically significant. It does this independent of protein sequence or any other information external to network topology.
Two-Stream Transformer Networks for Video-based Face Alignment.
Liu, Hao; Lu, Jiwen; Feng, Jianjiang; Zhou, Jie
2017-08-01
In this paper, we propose a two-stream transformer networks (TSTN) approach for video-based face alignment. Unlike conventional image-based face alignment approaches which cannot explicitly model the temporal dependency in videos and motivated by the fact that consistent movements of facial landmarks usually occur across consecutive frames, our TSTN aims to capture the complementary information of both the spatial appearance on still frames and the temporal consistency information across frames. To achieve this, we develop a two-stream architecture, which decomposes the video-based face alignment into spatial and temporal streams accordingly. Specifically, the spatial stream aims to transform the facial image to the landmark positions by preserving the holistic facial shape structure. Accordingly, the temporal stream encodes the video input as active appearance codes, where the temporal consistency information across frames is captured to help shape refinements. Experimental results on the benchmarking video-based face alignment datasets show very competitive performance of our method in comparisons to the state-of-the-arts.
Improved image alignment method in application to X-ray images and biological images.
Wang, Ching-Wei; Chen, Hsiang-Chou
2013-08-01
Alignment of medical images is a vital component of a large number of applications throughout the clinical track of events; not only within clinical diagnostic settings, but prominently so in the area of planning, consummation and evaluation of surgical and radiotherapeutical procedures. However, image registration of medical images is challenging because of variations on data appearance, imaging artifacts and complex data deformation problems. Hence, the aim of this study is to develop a robust image alignment method for medical images. An improved image registration method is proposed, and the method is evaluated with two types of medical data, including biological microscopic tissue images and dental X-ray images and compared with five state-of-the-art image registration techniques. The experimental results show that the presented method consistently performs well on both types of medical images, achieving 88.44 and 88.93% averaged registration accuracies for biological tissue images and X-ray images, respectively, and outperforms the benchmark methods. Based on the Tukey's honestly significant difference test and Fisher's least square difference test tests, the presented method performs significantly better than all existing methods (P ≤ 0.001) for tissue image alignment, and for the X-ray image registration, the proposed method performs significantly better than the two benchmark b-spline approaches (P < 0.001). The software implementation of the presented method and the data used in this study are made publicly available for scientific communities to use (http://www-o.ntust.edu.tw/∼cweiwang/ImprovedImageRegistration/). cweiwang@mail.ntust.edu.tw.
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
DNA Multiple Sequence Alignment Guided by Protein Domains: The MSA-PAD 2.0 Method.
Balech, Bachir; Monaco, Alfonso; Perniola, Michele; Santamaria, Monica; Donvito, Giacinto; Vicario, Saverio; Maggi, Giorgio; Pesole, Graziano
2018-01-01
Multiple sequence alignment (MSA) is a fundamental component in many DNA sequence analyses including metagenomics studies and phylogeny inference. When guided by protein profiles, DNA multiple alignments assume a higher precision and robustness. Here we present details of the use of the upgraded version of MSA-PAD (2.0), which is a DNA multiple sequence alignment framework able to align DNA sequences coding for single/multiple protein domains guided by PFAM or user-defined annotations. MSA-PAD has two alignment strategies, called "Gene" and "Genome," accounting for coding domains order and genomic rearrangements, respectively. Novel options were added to the present version, where the MSA can be guided by protein profiles provided by the user. This allows MSA-PAD 2.0 to run faster and to add custom protein profiles sometimes not present in PFAM database according to the user's interest. MSA-PAD 2.0 is currently freely available as a Web application at https://recasgateway.cloud.ba.infn.it/ .
Konc, Janez; Cesnik, Tomo; Konc, Joanna Trykowska; Penca, Matej; Janežič, Dušanka
2012-02-27
ProBiS-Database is a searchable repository of precalculated local structural alignments in proteins detected by the ProBiS algorithm in the Protein Data Bank. Identification of functionally important binding regions of the protein is facilitated by structural similarity scores mapped to the query protein structure. PDB structures that have been aligned with a query protein may be rapidly retrieved from the ProBiS-Database, which is thus able to generate hypotheses concerning the roles of uncharacterized proteins. Presented with uncharacterized protein structure, ProBiS-Database can discern relationships between such a query protein and other better known proteins in the PDB. Fast access and a user-friendly graphical interface promote easy exploration of this database of over 420 million local structural alignments. The ProBiS-Database is updated weekly and is freely available online at http://probis.cmm.ki.si/database.
Ontology for Semantic Data Integration in the Domain of IT Benchmarking.
Pfaff, Matthias; Neubig, Stefan; Krcmar, Helmut
2018-01-01
A domain-specific ontology for IT benchmarking has been developed to bridge the gap between a systematic characterization of IT services and their data-based valuation. Since information is generally collected during a benchmark exercise using questionnaires on a broad range of topics, such as employee costs, software licensing costs, and quantities of hardware, it is commonly stored as natural language text; thus, this information is stored in an intrinsically unstructured form. Although these data form the basis for identifying potentials for IT cost reductions, neither a uniform description of any measured parameters nor the relationship between such parameters exists. Hence, this work proposes an ontology for the domain of IT benchmarking, available at https://w3id.org/bmontology. The design of this ontology is based on requirements mainly elicited from a domain analysis, which considers analyzing documents and interviews with representatives from Small- and Medium-Sized Enterprises and Information and Communications Technology companies over the last eight years. The development of the ontology and its main concepts is described in detail (i.e., the conceptualization of benchmarking events, questionnaires, IT services, indicators and their values) together with its alignment with the DOLCE-UltraLite foundational ontology.
Kuharev, Jörg; Navarro, Pedro; Distler, Ute; Jahn, Olaf; Tenzer, Stefan
2015-09-01
Label-free quantification (LFQ) based on data-independent acquisition workflows currently experiences increasing popularity. Several software tools have been recently published or are commercially available. The present study focuses on the evaluation of three different software packages (Progenesis, synapter, and ISOQuant) supporting ion mobility enhanced data-independent acquisition data. In order to benchmark the LFQ performance of the different tools, we generated two hybrid proteome samples of defined quantitative composition containing tryptically digested proteomes of three different species (mouse, yeast, Escherichia coli). This model dataset simulates complex biological samples containing large numbers of both unregulated (background) proteins as well as up- and downregulated proteins with exactly known ratios between samples. We determined the number and dynamic range of quantifiable proteins and analyzed the influence of applied algorithms (retention time alignment, clustering, normalization, etc.) on quantification results. Analysis of technical reproducibility revealed median coefficients of variation of reported protein abundances below 5% for MS(E) data for Progenesis and ISOQuant. Regarding accuracy of LFQ, evaluation with synapter and ISOQuant yielded superior results compared to Progenesis. In addition, we discuss reporting formats and user friendliness of the software packages. The data generated in this study have been deposited to the ProteomeXchange Consortium with identifier PXD001240 (http://proteomecentral.proteomexchange.org/dataset/PXD001240). © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Small-molecule ligand docking into comparative models with Rosetta
Combs, Steven A; DeLuca, Samuel L; DeLuca, Stephanie H; Lemmon, Gordon H; Nannemann, David P; Nguyen, Elizabeth D; Willis, Jordan R; Sheehan, Jonathan H; Meiler, Jens
2017-01-01
Structure-based drug design is frequently used to accelerate the development of small-molecule therapeutics. Although substantial progress has been made in X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, the availability of high-resolution structures is limited owing to the frequent inability to crystallize or obtain sufficient NMR restraints for large or flexible proteins. Computational methods can be used to both predict unknown protein structures and model ligand interactions when experimental data are unavailable. This paper describes a comprehensive and detailed protocol using the Rosetta modeling suite to dock small-molecule ligands into comparative models. In the protocol presented here, we review the comparative modeling process, including sequence alignment, threading and loop building. Next, we cover docking a small-molecule ligand into the protein comparative model. In addition, we discuss criteria that can improve ligand docking into comparative models. Finally, and importantly, we present a strategy for assessing model quality. The entire protocol is presented on a single example selected solely for didactic purposes. The results are therefore not representative and do not replace benchmarks published elsewhere. We also provide an additional tutorial so that the user can gain hands-on experience in using Rosetta. The protocol should take 5–7 h, with additional time allocated for computer generation of models. PMID:23744289
Arora, Sanjeevani; Huwe, Peter J.; Sikder, Rahmat; Shah, Manali; Browne, Amanda J.; Lesh, Randy; Nicolas, Emmanuelle; Deshpande, Sanat; Hall, Michael J.; Dunbrack, Roland L.; Golemis, Erica A.
2017-01-01
ABSTRACT The cancer-predisposing Lynch Syndrome (LS) arises from germline mutations in DNA mismatch repair (MMR) genes, predominantly MLH1, MSH2, MSH6, and PMS2. A major challenge for clinical diagnosis of LS is the frequent identification of variants of uncertain significance (VUS) in these genes, as it is often difficult to determine variant pathogenicity, particularly for missense variants. Generic programs such as SIFT and PolyPhen-2, and MMR gene-specific programs such as PON-MMR and MAPP-MMR, are often used to predict deleterious or neutral effects of VUS in MMR genes. We evaluated the performance of multiple predictive programs in the context of functional biologic data for 15 VUS in MLH1, MSH2, and PMS2. Using cell line models, we characterized VUS predicted to range from neutral to pathogenic on mRNA and protein expression, basal cellular viability, viability following treatment with a panel of DNA-damaging agents, and functionality in DNA damage response (DDR) signaling, benchmarking to wild-type MMR proteins. Our results suggest that the MMR gene-specific classifiers do not always align with the experimental phenotypes related to DDR. Our study highlights the importance of complementary experimental and computational assessment to develop future predictors for the assessment of VUS. PMID:28494185
PROPER: global protein interaction network alignment through percolation matching.
Kazemi, Ehsan; Hassani, Hamed; Grossglauser, Matthias; Pezeshgi Modarres, Hassan
2016-12-12
The alignment of protein-protein interaction (PPI) networks enables us to uncover the relationships between different species, which leads to a deeper understanding of biological systems. Network alignment can be used to transfer biological knowledge between species. Although different PPI-network alignment algorithms were introduced during the last decade, developing an accurate and scalable algorithm that can find alignments with high biological and structural similarities among PPI networks is still challenging. In this paper, we introduce a new global network alignment algorithm for PPI networks called PROPER. Compared to other global network alignment methods, our algorithm shows higher accuracy and speed over real PPI datasets and synthetic networks. We show that the PROPER algorithm can detect large portions of conserved biological pathways between species. Also, using a simple parsimonious evolutionary model, we explain why PROPER performs well based on several different comparison criteria. We highlight that PROPER has high potential in further applications such as detecting biological pathways, finding protein complexes and PPI prediction. The PROPER algorithm is available at http://proper.epfl.ch .
Nielsen, Morten; Justesen, Sune; Lund, Ole; Lundegaard, Claus; Buus, Søren
2010-11-13
Binding of peptides to Major Histocompatibility class II (MHC-II) molecules play a central role in governing responses of the adaptive immune system. MHC-II molecules sample peptides from the extracellular space allowing the immune system to detect the presence of foreign microbes from this compartment. Predicting which peptides bind to an MHC-II molecule is therefore of pivotal importance for understanding the immune response and its effect on host-pathogen interactions. The experimental cost associated with characterizing the binding motif of an MHC-II molecule is significant and large efforts have therefore been placed in developing accurate computer methods capable of predicting this binding event. Prediction of peptide binding to MHC-II is complicated by the open binding cleft of the MHC-II molecule, allowing binding of peptides extending out of the binding groove. Moreover, the genes encoding the MHC molecules are immensely diverse leading to a large set of different MHC molecules each potentially binding a unique set of peptides. Characterizing each MHC-II molecule using peptide-screening binding assays is hence not a viable option. Here, we present an MHC-II binding prediction algorithm aiming at dealing with these challenges. The method is a pan-specific version of the earlier published allele-specific NN-align algorithm and does not require any pre-alignment of the input data. This allows the method to benefit also from information from alleles covered by limited binding data. The method is evaluated on a large and diverse set of benchmark data, and is shown to significantly out-perform state-of-the-art MHC-II prediction methods. In particular, the method is found to boost the performance for alleles characterized by limited binding data where conventional allele-specific methods tend to achieve poor prediction accuracy. The method thus shows great potential for efficient boosting the accuracy of MHC-II binding prediction, as accurate predictions can be obtained for novel alleles at highly reduced experimental costs. Pan-specific binding predictions can be obtained for all alleles with know protein sequence and the method can benefit by including data in the training from alleles even where only few binders are known. The method and benchmark data are available at http://www.cbs.dtu.dk/services/NetMHCIIpan-2.0.
Aligned Immobilization of Proteins Using AC Electric Fields.
Laux, Eva-Maria; Knigge, Xenia; Bier, Frank F; Wenger, Christian; Hölzel, Ralph
2016-03-01
Protein molecules are aligned and immobilized from solution by AC electric fields. In a single-step experiment, the enhanced green fluorescent proteins are immobilized on the surface as well as at the edges of planar nanoelectrodes. Alignment is found to follow the molecules' geometrical shape with their longitudinal axes parallel to the electric field. Simultaneous dielectrophoretic attraction and AC electroosmotic flow are identified as the dominant forces causing protein movement and alignment. Molecular orientation is determined by fluorescence microscopy based on polarized excitation of the proteins' chromophores. The chromophores' orientation with respect to the whole molecule supports X-ray crystal data. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Protein fold recognition using geometric kernel data fusion.
Zakeri, Pooya; Jeuris, Ben; Vandebril, Raf; Moreau, Yves
2014-07-01
Various approaches based on features extracted from protein sequences and often machine learning methods have been used in the prediction of protein folds. Finding an efficient technique for integrating these different protein features has received increasing attention. In particular, kernel methods are an interesting class of techniques for integrating heterogeneous data. Various methods have been proposed to fuse multiple kernels. Most techniques for multiple kernel learning focus on learning a convex linear combination of base kernels. In addition to the limitation of linear combinations, working with such approaches could cause a loss of potentially useful information. We design several techniques to combine kernel matrices by taking more involved, geometry inspired means of these matrices instead of convex linear combinations. We consider various sequence-based protein features including information extracted directly from position-specific scoring matrices and local sequence alignment. We evaluate our methods for classification on the SCOP PDB-40D benchmark dataset for protein fold recognition. The best overall accuracy on the protein fold recognition test set obtained by our methods is ∼ 86.7%. This is an improvement over the results of the best existing approach. Moreover, our computational model has been developed by incorporating the functional domain composition of proteins through a hybridization model. It is observed that by using our proposed hybridization model, the protein fold recognition accuracy is further improved to 89.30%. Furthermore, we investigate the performance of our approach on the protein remote homology detection problem by fusing multiple string kernels. The MATLAB code used for our proposed geometric kernel fusion frameworks are publicly available at http://people.cs.kuleuven.be/∼raf.vandebril/homepage/software/geomean.php?menu=5/. © The Author 2014. Published by Oxford University Press.
Integrative network alignment reveals large regions of global network similarity in yeast and human.
Kuchaiev, Oleksii; Przulj, Natasa
2011-05-15
High-throughput methods for detecting molecular interactions have produced large sets of biological network data with much more yet to come. Analogous to sequence alignment, efficient and reliable network alignment methods are expected to improve our understanding of biological systems. Unlike sequence alignment, network alignment is computationally intractable. Hence, devising efficient network alignment heuristics is currently a foremost challenge in computational biology. We introduce a novel network alignment algorithm, called Matching-based Integrative GRAph ALigner (MI-GRAAL), which can integrate any number and type of similarity measures between network nodes (e.g. proteins), including, but not limited to, any topological network similarity measure, sequence similarity, functional similarity and structural similarity. Hence, we resolve the ties in similarity measures and find a combination of similarity measures yielding the largest contiguous (i.e. connected) and biologically sound alignments. MI-GRAAL exposes the largest functional, connected regions of protein-protein interaction (PPI) network similarity to date: surprisingly, it reveals that 77.7% of proteins in the baker's yeast high-confidence PPI network participate in such a subnetwork that is fully contained in the human high-confidence PPI network. This is the first demonstration that species as diverse as yeast and human contain so large, continuous regions of global network similarity. We apply MI-GRAAL's alignments to predict functions of un-annotated proteins in yeast, human and bacteria validating our predictions in the literature. Furthermore, using network alignment scores for PPI networks of different herpes viruses, we reconstruct their phylogenetic relationship. This is the first time that phylogeny is exactly reconstructed from purely topological alignments of PPI networks. Supplementary files and MI-GRAAL executables: http://bio-nets.doc.ic.ac.uk/MI-GRAAL/.
Neuwald, Andrew F
2009-08-01
The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.
Aligning the SEA's Compliance Responsibilities and Performance Objectives. Benchmark. No. 3
ERIC Educational Resources Information Center
Nafziger, D.; Jochim, A.
2013-01-01
Assuring that state and local agencies comply with the requirements of Federal and state programs has been a central feature of SEAs. The purposes of compliance requirements--ensuring both fiscal integrity and that targeted groups receive intended benefits--are important. While such requirements are often well-intentioned, too often they can act…
MANGO: a new approach to multiple sequence alignment.
Zhang, Zefeng; Lin, Hao; Li, Ming
2007-01-01
Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.
Pre-calculated protein structure alignments at the RCSB PDB website.
Prlic, Andreas; Bliven, Spencer; Rose, Peter W; Bluhm, Wolfgang F; Bizon, Chris; Godzik, Adam; Bourne, Philip E
2010-12-01
With the continuous growth of the RCSB Protein Data Bank (PDB), providing an up-to-date systematic structure comparison of all protein structures poses an ever growing challenge. Here, we present a comparison tool for calculating both 1D protein sequence and 3D protein structure alignments. This tool supports various applications at the RCSB PDB website. First, a structure alignment web service calculates pairwise alignments. Second, a stand-alone application runs alignments locally and visualizes the results. Third, pre-calculated 3D structure comparisons for the whole PDB are provided and updated on a weekly basis. These three applications allow users to discover novel relationships between proteins available either at the RCSB PDB or provided by the user. A web user interface is available at http://www.rcsb.org/pdb/workbench/workbench.do. The source code is available under the LGPL license from http://www.biojava.org. A source bundle, prepared for local execution, is available from http://source.rcsb.org andreas@sdsc.edu; pbourne@ucsd.edu.
Alfa, Michelle J; Fatima, Iram; Olson, Nancy
2013-03-01
The study objective was to verify that the adenosine triphosphate (ATP) benchmark of <200 relative light units (RLUs) was achievable in a busy endoscopy clinic that followed the manufacturer's manual cleaning instructions. All channels from patient-used colonoscopes (20) and duodenoscopes (20) in a tertiary care hospital endoscopy clinic were sampled after manual cleaning and tested for residual ATP. The ATP test benchmark for adequate manual cleaning was set at <200 RLUs. The benchmark for protein was <6.4 μg/cm(2), and, for bioburden, it was <4-log10 colony-forming units/cm(2). Our data demonstrated that 96% (115/120) of channels from 20 colonoscopes and 20 duodenoscopes evaluated met the ATP benchmark of <200 RLUs. The 5 channels that exceeded 200 RLUs were all elevator guide-wire channels. All 120 of the manually cleaned endoscopes tested had protein and bioburden levels that were compliant with accepted benchmarks for manual cleaning for suction-biopsy, air-water, and auxiliary water channels. Our data confirmed that, by following the endoscope manufacturer's manual cleaning recommendations, 96% of channels in gastrointestinal endoscopes would have <200 RLUs for the ATP test kit evaluated and would meet the accepted clean benchmarks for protein and bioburden. Copyright © 2013 Association for Professionals in Infection Control and Epidemiology, Inc. Published by Mosby, Inc. All rights reserved.
Wang, Xu; Le, Anh-Thu; Yu, Chao; Lucchese, R. R.; Lin, C. D.
2016-01-01
We discuss a scheme to retrieve transient conformational molecular structure information using photoelectron angular distributions (PADs) that have averaged over partial alignments of isolated molecules. The photoelectron is pulled out from a localized inner-shell molecular orbital by an X-ray photon. We show that a transient change in the atomic positions from their equilibrium will lead to a sensitive change in the alignment-averaged PADs, which can be measured and used to retrieve the former. Exploiting the experimental convenience of changing the photon polarization direction, we show that it is advantageous to use PADs obtained from multiple photon polarization directions. A simple single-scattering model is proposed and benchmarked to describe the photoionization process and to do the retrieval using a multiple-parameter fitting method. PMID:27025410
NASA Astrophysics Data System (ADS)
Wang, Xu; Le, Anh-Thu; Yu, Chao; Lucchese, R. R.; Lin, C. D.
2016-03-01
We discuss a scheme to retrieve transient conformational molecular structure information using photoelectron angular distributions (PADs) that have averaged over partial alignments of isolated molecules. The photoelectron is pulled out from a localized inner-shell molecular orbital by an X-ray photon. We show that a transient change in the atomic positions from their equilibrium will lead to a sensitive change in the alignment-averaged PADs, which can be measured and used to retrieve the former. Exploiting the experimental convenience of changing the photon polarization direction, we show that it is advantageous to use PADs obtained from multiple photon polarization directions. A simple single-scattering model is proposed and benchmarked to describe the photoionization process and to do the retrieval using a multiple-parameter fitting method.
NASA Astrophysics Data System (ADS)
Leonardi, Marcelo
The primary purpose of this study was to examine the impact of a scheduling change from a trimester 4x4 block schedule to a modified hybrid schedule on student achievement in ninth grade biology courses. This study examined the impact of the scheduling change on student achievement through teacher created benchmark assessments in Genetics, DNA, and Evolution and on the California Standardized Test in Biology. The secondary purpose of this study examined the ninth grade biology teacher perceptions of ninth grade biology student achievement. Using a mixed methods research approach, data was collected both quantitatively and qualitatively as aligned to research questions. Quantitative methods included gathering data from departmental benchmark exams and California Standardized Test in Biology and conducting multiple analysis of covariance and analysis of covariance to determine significance differences. Qualitative methods include journal entries questions and focus group interviews. The results revealed a statistically significant increase in scores on both the DNA and Evolution benchmark exams. DNA and Evolution benchmark exams showed significant improvements from a change in scheduling format. The scheduling change was responsible for 1.5% of the increase in DNA benchmark scores and 2% of the increase in Evolution benchmark scores. The results revealed a statistically significant decrease in scores on the Genetics Benchmark exam as a result of the scheduling change. The scheduling change was responsible for 1% of the decrease in Genetics benchmark scores. The results also revealed a statistically significant increase in scores on the CST Biology exam. The scheduling change was responsible for .7% of the increase in CST Biology scores. Results of the focus group discussions indicated that all teachers preferred the modified hybrid schedule over the trimester schedule and that it improved student achievement.
SANA NetGO: a combinatorial approach to using Gene Ontology (GO) terms to score network alignments.
Hayes, Wayne B; Mamano, Nil
2018-04-15
Gene Ontology (GO) terms are frequently used to score alignments between protein-protein interaction (PPI) networks. Methods exist to measure GO similarity between proteins in isolation, but proteins in a network alignment are not isolated: each pairing is dependent on every other via the alignment itself. Existing measures fail to take into account the frequency of GO terms across networks, instead imposing arbitrary rules on when to allow GO terms. Here we develop NetGO, a new measure that naturally weighs infrequent, informative GO terms more heavily than frequent, less informative GO terms, without arbitrary cutoffs, instead downweighting GO terms according to their frequency in the networks being aligned. This is a global measure applicable only to alignments, independent of pairwise GO measures, in the same sense that the edge-based EC or S3 scores are global measures of topological similarity independent of pairwise topological similarities. We demonstrate the superiority of NetGO in alignments of predetermined quality and show that NetGO correlates with alignment quality better than any existing GO-based alignment measures. We also demonstrate that NetGO provides a measure of taxonomic similarity between species, consistent with existing taxonomic measuresa feature not shared with existing GObased network alignment measures. Finally, we re-score alignments produced by almost a dozen aligners from a previous study and show that NetGO does a better job at separating good alignments from bad ones. Available as part of SANA. whayes@uci.edu. Supplementary data are available at Bioinformatics online.
Parallel seed-based approach to multiple protein structure similarities detection
Chapuis, Guillaume; Le Boudic-Jamin, Mathilde; Andonov, Rumen; ...
2015-01-01
Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based approach that discovers multiple pairs of similar regions. Its computational complexity is polynomial and it comes with a quality guarantee—the returned alignments have both root mean squared deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e., we consider nonsequential alignments), which makesmore » our algorithm suitable for detecting similar domains when comparing multidomain proteins as well as to detect structural repetitions within a single protein. Because the search space for nonsequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level concurrency as well as vector instructions.« less
MutationAligner: a resource of recurrent mutation hotspots in protein domains in cancer
Gauthier, Nicholas Paul; Reznik, Ed; Gao, Jianjiong; Sumer, Selcuk Onur; Schultz, Nikolaus; Sander, Chris; Miller, Martin L.
2016-01-01
The MutationAligner web resource, available at http://www.mutationaligner.org, enables discovery and exploration of somatic mutation hotspots identified in protein domains in currently (mid-2015) more than 5000 cancer patient samples across 22 different tumor types. Using multiple sequence alignments of protein domains in the human genome, we extend the principle of recurrence analysis by aggregating mutations in homologous positions across sets of paralogous genes. Protein domain analysis enhances the statistical power to detect cancer-relevant mutations and links mutations to the specific biological functions encoded in domains. We illustrate how the MutationAligner database and interactive web tool can be used to explore, visualize and analyze mutation hotspots in protein domains across genes and tumor types. We believe that MutationAligner will be an important resource for the cancer research community by providing detailed clues for the functional importance of particular mutations, as well as for the design of functional genomics experiments and for decision support in precision medicine. MutationAligner is slated to be periodically updated to incorporate additional analyses and new data from cancer genomics projects. PMID:26590264
Statistical inference of protein structural alignments using information and compression.
Collier, James H; Allison, Lloyd; Lesk, Arthur M; Stuckey, Peter J; Garcia de la Banda, Maria; Konagurthu, Arun S
2017-04-01
Structural molecular biology depends crucially on computational techniques that compare protein three-dimensional structures and generate structural alignments (the assignment of one-to-one correspondences between subsets of amino acids based on atomic coordinates). Despite its importance, the structural alignment problem has not been formulated, much less solved, in a consistent and reliable way. To overcome these difficulties, we present here a statistical framework for the precise inference of structural alignments, built on the Bayesian and information-theoretic principle of Minimum Message Length (MML). The quality of any alignment is measured by its explanatory power-the amount of lossless compression achieved to explain the protein coordinates using that alignment. We have implemented this approach in MMLigner , the first program able to infer statistically significant structural alignments. We also demonstrate the reliability of MMLigner 's alignment results when compared with the state of the art. Importantly, MMLigner can also discover different structural alignments of comparable quality, a challenging problem for oligomers and protein complexes. Source code, binaries and an interactive web version are available at http://lcb.infotech.monash.edu.au/mmligner . arun.konagurthu@monash.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
A statistical physics perspective on alignment-independent protein sequence comparison.
Chattopadhyay, Amit K; Nasiev, Diar; Flower, Darren R
2015-08-01
Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from 'first passage probability distribution' to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. © The Author 2015. Published by Oxford University Press.
SeqLib: a C ++ API for rapid BAM manipulation, sequence alignment and sequence assembly
Wala, Jeremiah; Beroukhim, Rameen
2017-01-01
Abstract We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment. Availability and Implementation: SeqLib is available on Linux and OSX for the C ++98 standard and later at github.com/walaj/SeqLib. SeqLib is released under the Apache2 license. Additional capabilities for BLAT alignment are available under the BLAT license. Contact: jwala@broadinstitue.org; rameen@broadinstitute.org PMID:28011768
SeqLib: a C ++ API for rapid BAM manipulation, sequence alignment and sequence assembly.
Wala, Jeremiah; Beroukhim, Rameen
2017-03-01
We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment. SeqLib is available on Linux and OSX for the C ++98 standard and later at github.com/walaj/SeqLib. SeqLib is released under the Apache2 license. Additional capabilities for BLAT alignment are available under the BLAT license. jwala@broadinstitue.org ; rameen@broadinstitute.org. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
The fractured landscape of RNA-seq alignment: the default in our STARs.
Ballouz, Sara; Dobin, Alexander; Gingeras, Thomas R; Gillis, Jesse
2018-06-01
Many tools are available for RNA-seq alignment and expression quantification, with comparative value being hard to establish. Benchmarking assessments often highlight methods' good performance, but are focused on either model data or fail to explain variation in performance. This leaves us to ask, what is the most meaningful way to assess different alignment choices? And importantly, where is there room for progress? In this work, we explore the answers to these two questions by performing an exhaustive assessment of the STAR aligner. We assess STAR's performance across a range of alignment parameters using common metrics, and then on biologically focused tasks. We find technical metrics such as fraction mapping or expression profile correlation to be uninformative, capturing properties unlikely to have any role in biological discovery. Surprisingly, we find that changes in alignment parameters within a wide range have little impact on both technical and biological performance. Yet, when performance finally does break, it happens in difficult regions, such as X-Y paralogs and MHC genes. We believe improved reporting by developers will help establish where results are likely to be robust or fragile, providing a better baseline to establish where methodological progress can still occur.
Cao, Hu; Lu, Yonggang
2017-01-01
With the rapid growth of known protein 3D structures in number, how to efficiently compare protein structures becomes an essential and challenging problem in computational structural biology. At present, many protein structure alignment methods have been developed. Among all these methods, flexible structure alignment methods are shown to be superior to rigid structure alignment methods in identifying structure similarities between proteins, which have gone through conformational changes. It is also found that the methods based on aligned fragment pairs (AFPs) have a special advantage over other approaches in balancing global structure similarities and local structure similarities. Accordingly, we propose a new flexible protein structure alignment method based on variable-length AFPs. Compared with other methods, the proposed method possesses three main advantages. First, it is based on variable-length AFPs. The length of each AFP is separately determined to maximally represent a local similar structure fragment, which reduces the number of AFPs. Second, it uses local coordinate systems, which simplify the computation at each step of the expansion of AFPs during the AFP identification. Third, it decreases the number of twists by rewarding the situation where nonconsecutive AFPs share the same transformation in the alignment, which is realized by dynamic programming with an improved transition function. The experimental data show that compared with FlexProt, FATCAT, and FlexSnap, the proposed method can achieve comparable results by introducing fewer twists. Meanwhile, it can generate results similar to those of the FATCAT method in much less running time due to the reduced number of AFPs.
ERIC Educational Resources Information Center
Achieve, Inc., 2010
2010-01-01
In response to concerns over the need for a scientifically literate workforce, increasing the STEM pipeline, and aging science standards documents, the scientific and science education communities are embarking on the development of a new conceptual framework for science, led by the National Research Council (NRC), and aligned next generation…
Surflex-Dock: Docking benchmarks and real-world application
NASA Astrophysics Data System (ADS)
Spitzer, Russell; Jain, Ajay N.
2012-06-01
Benchmarks for molecular docking have historically focused on re-docking the cognate ligand of a well-determined protein-ligand complex to measure geometric pose prediction accuracy, and measurement of virtual screening performance has been focused on increasingly large and diverse sets of target protein structures, cognate ligands, and various types of decoy sets. Here, pose prediction is reported on the Astex Diverse set of 85 protein ligand complexes, and virtual screening performance is reported on the DUD set of 40 protein targets. In both cases, prepared structures of targets and ligands were provided by symposium organizers. The re-prepared data sets yielded results not significantly different than previous reports of Surflex-Dock on the two benchmarks. Minor changes to protein coordinates resulting from complex pre-optimization had large effects on observed performance, highlighting the limitations of cognate ligand re-docking for pose prediction assessment. Docking protocols developed for cross-docking, which address protein flexibility and produce discrete families of predicted poses, produced substantially better performance for pose prediction. Performance on virtual screening performance was shown to benefit by employing and combining multiple screening methods: docking, 2D molecular similarity, and 3D molecular similarity. In addition, use of multiple protein conformations significantly improved screening enrichment.
VANLO - Interactive visual exploration of aligned biological networks
Brasch, Steffen; Linsen, Lars; Fuellen, Georg
2009-01-01
Background Protein-protein interaction (PPI) is fundamental to many biological processes. In the course of evolution, biological networks such as protein-protein interaction networks have developed. Biological networks of different species can be aligned by finding instances (e.g. proteins) with the same common ancestor in the evolutionary process, so-called orthologs. For a better understanding of the evolution of biological networks, such aligned networks have to be explored. Visualization can play a key role in making the various relationships transparent. Results We present a novel visualization system for aligned biological networks in 3D space that naturally embeds existing 2D layouts. In addition to displaying the intra-network connectivities, we also provide insight into how the individual networks relate to each other by placing aligned entities on top of each other in separate layers. We optimize the layout of the entire alignment graph in a global fashion that takes into account inter- as well as intra-network relationships. The layout algorithm includes a step of merging aligned networks into one graph, laying out the graph with respect to application-specific requirements, splitting the merged graph again into individual networks, and displaying the network alignment in layers. In addition to representing the data in a static way, we also provide different interaction techniques to explore the data with respect to application-specific tasks. Conclusion Our system provides an intuitive global understanding of aligned PPI networks and it allows the investigation of key biological questions. We evaluate our system by applying it to real-world examples documenting how our system can be used to investigate the data with respect to these key questions. Our tool VANLO (Visualization of Aligned Networks with Layout Optimization) can be accessed at . PMID:19821976
Lagarde, Nathalie; Zagury, Jean-François; Montes, Matthieu
2015-07-27
Virtual screening methods are commonly used nowadays in drug discovery processes. However, to ensure their reliability, they have to be carefully evaluated. The evaluation of these methods is often realized in a retrospective way, notably by studying the enrichment of benchmarking data sets. To this purpose, numerous benchmarking data sets were developed over the years, and the resulting improvements led to the availability of high quality benchmarking data sets. However, some points still have to be considered in the selection of the active compounds, decoys, and protein structures to obtain optimal benchmarking data sets.
Wang, Xu; Le, Anh -Thu; Yu, Chao; ...
2016-03-30
We discuss a scheme to retrieve transient conformational molecular structure information using photoelectron angular distributions (PADs) that have averaged over partial alignments of isolated molecules. The photoelectron is pulled out from a localized inner-shell molecular orbital by an X-ray photon. We show that a transient change in the atomic positions from their equilibrium will lead to a sensitive change in the alignment-averaged PADs, which can be measured and used to retrieve the former. Exploiting the experimental convenience of changing the photon polarization direction, we show that it is advantageous to use PADs obtained from multiple photon polarization directions. Lastly, amore » simple single-scattering model is proposed and benchmarked to describe the photoionization process and to do the retrieval using a multiple-parameter fitting method.« less
MutationAligner: a resource of recurrent mutation hotspots in protein domains in cancer.
Gauthier, Nicholas Paul; Reznik, Ed; Gao, Jianjiong; Sumer, Selcuk Onur; Schultz, Nikolaus; Sander, Chris; Miller, Martin L
2016-01-04
The MutationAligner web resource, available at http://www.mutationaligner.org, enables discovery and exploration of somatic mutation hotspots identified in protein domains in currently (mid-2015) more than 5000 cancer patient samples across 22 different tumor types. Using multiple sequence alignments of protein domains in the human genome, we extend the principle of recurrence analysis by aggregating mutations in homologous positions across sets of paralogous genes. Protein domain analysis enhances the statistical power to detect cancer-relevant mutations and links mutations to the specific biological functions encoded in domains. We illustrate how the MutationAligner database and interactive web tool can be used to explore, visualize and analyze mutation hotspots in protein domains across genes and tumor types. We believe that MutationAligner will be an important resource for the cancer research community by providing detailed clues for the functional importance of particular mutations, as well as for the design of functional genomics experiments and for decision support in precision medicine. MutationAligner is slated to be periodically updated to incorporate additional analyses and new data from cancer genomics projects. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Face Alignment via Regressing Local Binary Features.
Ren, Shaoqing; Cao, Xudong; Wei, Yichen; Sun, Jian
2016-03-01
This paper presents a highly efficient and accurate regression approach for face alignment. Our approach has two novel components: 1) a set of local binary features and 2) a locality principle for learning those features. The locality principle guides us to learn a set of highly discriminative local binary features for each facial landmark independently. The obtained local binary features are used to jointly learn a linear regression for the final output. This approach achieves the state-of-the-art results when tested on the most challenging benchmarks to date. Furthermore, because extracting and regressing local binary features are computationally very cheap, our system is much faster than previous methods. It achieves over 3000 frames per second (FPS) on a desktop or 300 FPS on a mobile phone for locating a few dozens of landmarks. We also study a key issue that is important but has received little attention in the previous research, which is the face detector used to initialize alignment. We investigate several face detectors and perform quantitative evaluation on how they affect alignment accuracy. We find that an alignment friendly detector can further greatly boost the accuracy of our alignment method, reducing the error up to 16% relatively. To facilitate practical usage of face detection/alignment methods, we also propose a convenient metric to measure how good a detector is for alignment initialization.
HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy.
Yan, Yumeng; Zhang, Di; Zhou, Pei; Li, Botong; Huang, Sheng-You
2017-07-03
Protein-protein and protein-DNA/RNA interactions play a fundamental role in a variety of biological processes. Determining the complex structures of these interactions is valuable, in which molecular docking has played an important role. To automatically make use of the binding information from the PDB in docking, here we have presented HDOCK, a novel web server of our hybrid docking algorithm of template-based modeling and free docking, in which cases with misleading templates can be rescued by the free docking protocol. The server supports protein-protein and protein-DNA/RNA docking and accepts both sequence and structure inputs for proteins. The docking process is fast and consumes about 10-20 min for a docking run. Tested on the cases with weakly homologous complexes of <30% sequence identity from five docking benchmarks, the HDOCK pipeline tied with template-based modeling on the protein-protein and protein-DNA benchmarks and performed better than template-based modeling on the three protein-RNA benchmarks when the top 10 predictions were considered. The performance of HDOCK became better when more predictions were considered. Combining the results of HDOCK and template-based modeling by ranking first of the template-based model further improved the predictive power of the server. The HDOCK web server is available at http://hdock.phys.hust.edu.cn/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Global Alignment of Pairwise Protein Interaction Networks for Maximal Common Conserved Patterns
Tian, Wenhong; Samatova, Nagiza F.
2013-01-01
A number of tools for the alignment of protein-protein interaction (PPI) networks have laid the foundation for PPI network analysis. Most of alignment tools focus on finding conserved interaction regions across the PPI networks through either local or global mapping of similar sequences. Researchers are still trying to improve the speed, scalability, and accuracy of network alignment. In view of this, we introduce a connected-components based fast algorithm, HopeMap, for network alignment. Observing that the size of true orthologs across species is small comparing to the total number of proteins in all species, we take a different approach based onmore » a precompiled list of homologs identified by KO terms. Applying this approach to S. cerevisiae (yeast) and D. melanogaster (fly), E. coli K12 and S. typhimurium , E. coli K12 and C. crescenttus , we analyze all clusters identified in the alignment. The results are evaluated through up-to-date known gene annotations, gene ontology (GO), and KEGG ortholog groups (KO). Comparing to existing tools, our approach is fast with linear computational cost, highly accurate in terms of KO and GO terms specificity and sensitivity, and can be extended to multiple alignments easily.« less
L-GRAAL: Lagrangian graphlet-based network aligner.
Malod-Dognin, Noël; Pržulj, Nataša
2015-07-01
Discovering and understanding patterns in networks of protein-protein interactions (PPIs) is a central problem in systems biology. Alignments between these networks aid functional understanding as they uncover important information, such as evolutionary conserved pathways, protein complexes and functional orthologs. A few methods have been proposed for global PPI network alignments, but because of NP-completeness of underlying sub-graph isomorphism problem, producing topologically and biologically accurate alignments remains a challenge. We introduce a novel global network alignment tool, Lagrangian GRAphlet-based ALigner (L-GRAAL), which directly optimizes both the protein and the interaction functional conservations, using a novel alignment search heuristic based on integer programming and Lagrangian relaxation. We compare L-GRAAL with the state-of-the-art network aligners on the largest available PPI networks from BioGRID and observe that L-GRAAL uncovers the largest common sub-graphs between the networks, as measured by edge-correctness and symmetric sub-structures scores, which allow transferring more functional information across networks. We assess the biological quality of the protein mappings using the semantic similarity of their Gene Ontology annotations and observe that L-GRAAL best uncovers functionally conserved proteins. Furthermore, we introduce for the first time a measure of the semantic similarity of the mapped interactions and show that L-GRAAL also uncovers best functionally conserved interactions. In addition, we illustrate on the PPI networks of baker's yeast and human the ability of L-GRAAL to predict new PPIs. Finally, L-GRAAL's results are the first to show that topological information is more important than sequence information for uncovering functionally conserved interactions. L-GRAAL is coded in C++. Software is available at: http://bio-nets.doc.ic.ac.uk/L-GRAAL/. n.malod-dognin@imperial.ac.uk Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
TMFoldWeb: a web server for predicting transmembrane protein fold class.
Kozma, Dániel; Tusnády, Gábor E
2015-09-17
Here we present TMFoldWeb, the web server implementation of TMFoldRec, a transmembrane protein fold recognition algorithm. TMFoldRec uses statistical potentials and utilizes topology filtering and a gapless threading algorithm. It ranks template structures and selects the most likely candidates and estimates the reliability of the obtained lowest energy model. The statistical potential was developed in a maximum likelihood framework on a representative set of the PDBTM database. According to the benchmark test the performance of TMFoldRec is about 77 % in correctly predicting fold class for a given transmembrane protein sequence. An intuitive web interface has been developed for the recently published TMFoldRec algorithm. The query sequence goes through a pipeline of topology prediction and a systematic sequence to structure alignment (threading). Resulting templates are ordered by energy and reliability values and are colored according to their significance level. Besides the graphical interface, a programmatic access is available as well, via a direct interface for developers or for submitting genome-wide data sets. The TMFoldWeb web server is unique and currently the only web server that is able to predict the fold class of transmembrane proteins while assigning reliability scores for the prediction. This method is prepared for genome-wide analysis with its easy-to-use interface, informative result page and programmatic access. Considering the info-communication evolution in the last few years, the developed web server, as well as the molecule viewer, is responsive and fully compatible with the prevalent tablets and mobile devices.
MIPS bacterial genomes functional annotation benchmark dataset.
Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen
2005-05-15
Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab
ERIC Educational Resources Information Center
Serva, Mark A.; Fuller, Mark A.
2004-01-01
Current methods of evaluating learning and instruction have not kept pace with changes in learning theory, or with the transformed technological infrastructure of the modern business school classroom. Without reliable and valid instructional measurement systems, it is virtually impossible to benchmark new pedagogical techniques, assess the value…
Antibody-protein interactions: benchmark datasets and prediction tools evaluation
Ponomarenko, Julia V; Bourne, Philip E
2007-01-01
Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It is an open question as to whether ultimately discriminatory features can be found. PMID:17910770
Samudrala, Ram
2015-01-01
We have examined the effect of eight different protein classes (channels, GPCRs, kinases, ligases, nuclear receptors, proteases, phosphatases, transporters) on the benchmarking performance of the CANDO drug discovery and repurposing platform (http://protinfo.org/cando). The first version of the CANDO platform utilizes a matrix of predicted interactions between 48278 proteins and 3733 human ingestible compounds (including FDA approved drugs and supplements) that map to 2030 indications/diseases using a hierarchical chem and bio-informatic fragment based docking with dynamics protocol (> one billion predicted interactions considered). The platform uses similarity of compound-proteome interaction signatures as indicative of similar functional behavior and benchmarking accuracy is calculated across 1439 indications/diseases with more than one approved drug. The CANDO platform yields a significant correlation (0.99, p-value < 0.0001) between the number of proteins considered and benchmarking accuracy obtained indicating the importance of multitargeting for drug discovery. Average benchmarking accuracies range from 6.2 % to 7.6 % for the eight classes when the top 10 ranked compounds are considered, in contrast to a range of 5.5 % to 11.7 % obtained for the comparison/control sets consisting of 10, 100, 1000, and 10000 single best performing proteins. These results are generally two orders of magnitude better than the average accuracy of 0.2% obtained when randomly generated (fully scrambled) matrices are used. Different indications perform well when different classes are used but the best accuracies (up to 11.7% for the top 10 ranked compounds) are achieved when a combination of classes are used containing the broadest distribution of protein folds. Our results illustrate the utility of the CANDO approach and the consideration of different protein classes for devising indication specific protocols for drug repurposing as well as drug discovery. PMID:25694071
Integrated crystal mounting and alignment system for high-throughput biological crystallography
Nordmeyer, Robert A.; Snell, Gyorgy P.; Cornell, Earl W.; Kolbe, William F.; Yegian, Derek T.; Earnest, Thomas N.; Jaklevich, Joseph M.; Cork, Carl W.; Santarsiero, Bernard D.; Stevens, Raymond C.
2007-09-25
A method and apparatus for the transportation, remote and unattended mounting, and visual alignment and monitoring of protein crystals for synchrotron generated x-ray diffraction analysis. The protein samples are maintained at liquid nitrogen temperatures at all times: during shipment, before mounting, mounting, alignment, data acquisition and following removal. The samples must additionally be stably aligned to within a few microns at a point in space. The ability to accurately perform these tasks remotely and automatically leads to a significant increase in sample throughput and reliability for high-volume protein characterization efforts. Since the protein samples are placed in a shipping-compatible layered stack of sample cassettes each holding many samples, a large number of samples can be shipped in a single cryogenic shipping container.
Integrated crystal mounting and alignment system for high-throughput biological crystallography
Nordmeyer, Robert A.; Snell, Gyorgy P.; Cornell, Earl W.; Kolbe, William; Yegian, Derek; Earnest, Thomas N.; Jaklevic, Joseph M.; Cork, Carl W.; Santarsiero, Bernard D.; Stevens, Raymond C.
2005-07-19
A method and apparatus for the transportation, remote and unattended mounting, and visual alignment and monitoring of protein crystals for synchrotron generated x-ray diffraction analysis. The protein samples are maintained at liquid nitrogen temperatures at all times: during shipment, before mounting, mounting, alignment, data acquisition and following removal. The samples must additionally be stably aligned to within a few microns at a point in space. The ability to accurately perform these tasks remotely and automatically leads to a significant increase in sample throughput and reliability for high-volume protein characterization efforts. Since the protein samples are placed in a shipping-compatible layered stack of sample cassettes each holding many samples, a large number of samples can be shipped in a single cryogenic shipping container.
LenVarDB: database of length-variant protein domains.
Mutt, Eshita; Mathew, Oommen K; Sowdhamini, Ramanathan
2014-01-01
Protein domains are functionally and structurally independent modules, which add to the functional variety of proteins. This array of functional diversity has been enabled by evolutionary changes, such as amino acid substitutions or insertions or deletions, occurring in these protein domains. Length variations (indels) can introduce changes at structural, functional and interaction levels. LenVarDB (freely available at http://caps.ncbs.res.in/lenvardb/) traces these length variations, starting from structure-based sequence alignments in our Protein Alignments organized as Structural Superfamilies (PASS2) database, across 731 structural classification of proteins (SCOP)-based protein domain superfamilies connected to 2 730 625 sequence homologues. Alignment of sequence homologues corresponding to a structural domain is available, starting from a structure-based sequence alignment of the superfamily. Orientation of the length-variant (indel) regions in protein domains can be visualized by mapping them on the structure and on the alignment. Knowledge about location of length variations within protein domains and their visual representation will be useful in predicting changes within structurally or functionally relevant sites, which may ultimately regulate protein function. Non-technical summary: Evolutionary changes bring about natural changes to proteins that may be found in many organisms. Such changes could be reflected as amino acid substitutions or insertions-deletions (indels) in protein sequences. LenVarDB is a database that provides an early overview of observed length variations that were set among 731 protein families and after examining >2 million sequences. Indels are followed up to observe if they are close to the active site such that they can affect the activity of proteins. Inclusion of such information can aid the design of bioengineering experiments.
Graph wavelet alignment kernels for drug virtual screening.
Smalter, Aaron; Huan, Jun; Lushington, Gerald
2009-06-01
In this paper, we introduce a novel statistical modeling technique for target property prediction, with applications to virtual screening and drug design. In our method, we use graphs to model chemical structures and apply a wavelet analysis of graphs to summarize features capturing graph local topology. We design a novel graph kernel function to utilize the topology features to build predictive models for chemicals via Support Vector Machine classifier. We call the new graph kernel a graph wavelet-alignment kernel. We have evaluated the efficacy of the wavelet-alignment kernel using a set of chemical structure-activity prediction benchmarks. Our results indicate that the use of the kernel function yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art chemical classification approaches. In addition, our results also show that the use of wavelet functions significantly decreases the computational costs for graph kernel computation with more than ten fold speedup.
A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms.
The, Matthew; Edfors, Fredrik; Perez-Riverol, Yasset; Payne, Samuel H; Hoopmann, Michael R; Palmblad, Magnus; Forsström, Björn; Käll, Lukas
2018-05-04
A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.
Web-Beagle: a web server for the alignment of RNA secondary structures.
Mattei, Eugenio; Pietrosanto, Marco; Ferrè, Fabrizio; Helmer-Citterich, Manuela
2015-07-01
Web-Beagle (http://beagle.bio.uniroma2.it) is a web server for the pairwise global or local alignment of RNA secondary structures. The server exploits a new encoding for RNA secondary structure and a substitution matrix of RNA structural elements to perform RNA structural alignments. The web server allows the user to compute up to 10 000 alignments in a single run, taking as input sets of RNA sequences and structures or primary sequences alone. In the latter case, the server computes the secondary structure prediction for the RNAs on-the-fly using RNAfold (free energy minimization). The user can also compare a set of input RNAs to one of five pre-compiled RNA datasets including lncRNAs and 3' UTRs. All types of comparison produce in output the pairwise alignments along with structural similarity and statistical significance measures for each resulting alignment. A graphical color-coded representation of the alignments allows the user to easily identify structural similarities between RNAs. Web-Beagle can be used for finding structurally related regions in two or more RNAs, for the identification of homologous regions or for functional annotation. Benchmark tests show that Web-Beagle has lower computational complexity, running time and better performances than other available methods. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Sela, Itamar; Ashkenazy, Haim; Katoh, Kazutaka; Pupko, Tal
2015-07-01
Inference of multiple sequence alignments (MSAs) is a critical part of phylogenetic and comparative genomics studies. However, from the same set of sequences different MSAs are often inferred, depending on the methodologies used and the assumed parameters. Much effort has recently been devoted to improving the ability to identify unreliable alignment regions. Detecting such unreliable regions was previously shown to be important for downstream analyses relying on MSAs, such as the detection of positive selection. Here we developed GUIDANCE2, a new integrative methodology that accounts for: (i) uncertainty in the process of indel formation, (ii) uncertainty in the assumed guide tree and (iii) co-optimal solutions in the pairwise alignments, used as building blocks in progressive alignment algorithms. We compared GUIDANCE2 with seven methodologies to detect unreliable MSA regions using extensive simulations and empirical benchmarks. We show that GUIDANCE2 outperforms all previously developed methodologies. Furthermore, GUIDANCE2 also provides a set of alternative MSAs which can be useful for downstream analyses. The novel algorithm is implemented as a web-server, available at: http://guidance.tau.ac.il. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Krissinel, E; Henrick, K
2004-12-01
The present paper describes the SSM algorithm of protein structure comparison in three dimensions, which includes an original procedure of matching graphs built on the protein's secondary-structure elements, followed by an iterative three-dimensional alignment of protein backbone Calpha atoms. The SSM results are compared with those obtained from other protein comparison servers, and the advantages and disadvantages of different scores that are used for structure recognition are discussed. A new score, balancing the r.m.s.d. and alignment length Nalign, is proposed. It is found that different servers agree reasonably well on the new score, while showing considerable differences in r.m.s.d. and Nalign.
Mango: multiple alignment with N gapped oligos.
Zhang, Zefeng; Lin, Hao; Li, Ming
2008-06-01
Multiple sequence alignment is a classical and challenging task. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state-of-the-art works suffer from the "once a gap, always a gap" phenomenon. Is there a radically new way to do multiple sequence alignment? In this paper, we introduce a novel and orthogonal multiple sequence alignment method, using both multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole and tries to build the alignment vertically, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds have proved significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks, showing that MANGO compares favorably, in both accuracy and speed, against state-of-the-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, ProbConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0, and Kalign 2.0. We have further demonstrated the scalability of MANGO on very large datasets of repeat elements. MANGO can be downloaded at http://www.bioinfo.org.cn/mango/ and is free for academic usage.
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daily, Jeffrey A.
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments
Daily, Jeffrey A.
2016-02-10
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
Wu, Laying; Lee, L Andrew; Niu, Zhongwei; Ghoshroy, Soumitra; Wang, Qian
2011-08-02
Topographical features ranging from micro- to nanometers can affect cell orientation and migratory pathways, which are important factors in tissue engineering and tumor migration. In our previous study, a convective assembly of bacteriophage M13 resulted in thin films which could be used to control the alignment of cells. However, several questions regarding its underlying reasons to dictate cell alignment remained unanswered. Here, we further study the nanometer topographical features generated by the bacteriophage M13 crystalline film, which results in the alignment of the cells and extracellular matrix (ECM) proteins. Sequential imaging analyses at micro- and nanoscale levels of aligned cells and fibrillar matrix proteins were documented using scanning electron microscopy and immunofluorescence microscopy. As a result, we observed baby hamster kidney cells with higher degree of alignment on the ordered M13 substrates than NIH-3T3 fibroblasts, a difference which could be attributed to the intrinsic nature of the cells' production of ECM proteins. The results from this study provide a crucial insight into the topographical features of a biological thin film, which can be utilized to control the orientation of cells and surrounding ECM proteins.
Vamparys, Lydie; Laurent, Benoist; Carbone, Alessandra; Sacquin-Mora, Sophie
2016-10-01
Protein-protein interactions play a key part in most biological processes and understanding their mechanism is a fundamental problem leading to numerous practical applications. The prediction of protein binding sites in particular is of paramount importance since proteins now represent a major class of therapeutic targets. Amongst others methods, docking simulations between two proteins known to interact can be a useful tool for the prediction of likely binding patches on a protein surface. From the analysis of the protein interfaces generated by a massive cross-docking experiment using the 168 proteins of the Docking Benchmark 2.0, where all possible protein pairs, and not only experimental ones, have been docked together, we show that it is also possible to predict a protein's binding residues without having any prior knowledge regarding its potential interaction partners. Evaluating the performance of cross-docking predictions using the area under the specificity-sensitivity ROC curve (AUC) leads to an AUC value of 0.77 for the complete benchmark (compared to the 0.5 AUC value obtained for random predictions). Furthermore, a new clustering analysis performed on the binding patches that are scattered on the protein surface show that their distribution and growth will depend on the protein's functional group. Finally, in several cases, the binding-site predictions resulting from the cross-docking simulations will lead to the identification of an alternate interface, which corresponds to the interaction with a biomolecular partner that is not included in the original benchmark. Proteins 2016; 84:1408-1421. © 2016 The Authors Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. © 2016 The Authors Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.
Protein local structure alignment under the discrete Fréchet distance.
Zhu, Binhai
2007-12-01
Protein structure alignment is a fundamental problem in computational and structural biology. While there has been lots of experimental/heuristic methods and empirical results, very few results are known regarding the algorithmic/complexity aspects of the problem, especially on protein local structure alignment. A well-known measure to characterize the similarity of two polygonal chains is the famous Fréchet distance, and with the application of protein-related research, a related discrete Fréchet distance has been used recently. In this paper, following the recent work of Jiang et al. we investigate the protein local structural alignment problem using bounded discrete Fréchet distance. Given m proteins (or protein backbones, which are 3D polygonal chains), each of length O(n), our main results are summarized as follows: * If the number of proteins, m, is not part of the input, then the problem is NP-complete; moreover, under bounded discrete Fréchet distance it is NP-hard to approximate the maximum size common local structure within a factor of n(1-epsilon). These results hold both when all the proteins are static and when translation/rotation are allowed. * If the number of proteins, m, is a constant, then there is a polynomial time solution for the problem.
DNA Nanotubes for NMR Structure Determination of Membrane Proteins
Bellot, Gaëtan; McClintock, Mark A.; Chou, James J; Shih, William M.
2013-01-01
Structure determination of integral membrane proteins by solution NMR represents one of the most important challenges of structural biology. A Residual-Dipolar-Coupling-based refinement approach can be used to solve the structure of membrane proteins up to 40 kDa in size, however, a weak-alignment medium that is detergent-resistant is required. Previously, availability of media suitable for weak alignment of membrane proteins was severely limited. We describe here a protocol for robust, large-scale synthesis of detergent-resistant DNA nanotubes that can be assembled into dilute liquid crystals for application as weak-alignment media in solution NMR structure determination of membrane proteins in detergent micelles. The DNA nanotubes are heterodimers of 400nm-long six-helix bundles each self-assembled from a M13-based p7308 scaffold strand and >170 short oligonucleotide staple strands. Compatibility with proteins bearing considerable positive charge as well as modulation of molecular alignment, towards collection of linearly independent restraints, can be introduced by reducing the negative charge of DNA nanotubes via counter ions and small DNA binding molecules. This detergent-resistant liquid-crystal media offers a number of properties conducive for membrane protein alignment, including high-yield production, thermal stability, buffer compatibility, and structural programmability. Production of sufficient nanotubes for 4–5 NMR experiments can be completed in one week by a single individual. PMID:23518667
DNA nanotubes for NMR structure determination of membrane proteins.
Bellot, Gaëtan; McClintock, Mark A; Chou, James J; Shih, William M
2013-04-01
Finding a way to determine the structures of integral membrane proteins using solution nuclear magnetic resonance (NMR) spectroscopy has proved to be challenging. A residual-dipolar-coupling-based refinement approach can be used to resolve the structure of membrane proteins up to 40 kDa in size, but to do this you need a weak-alignment medium that is detergent-resistant and it has thus far been difficult to obtain such a medium suitable for weak alignment of membrane proteins. We describe here a protocol for robust, large-scale synthesis of detergent-resistant DNA nanotubes that can be assembled into dilute liquid crystals for application as weak-alignment media in solution NMR structure determination of membrane proteins in detergent micelles. The DNA nanotubes are heterodimers of 400-nm-long six-helix bundles, each self-assembled from a M13-based p7308 scaffold strand and >170 short oligonucleotide staple strands. Compatibility with proteins bearing considerable positive charge as well as modulation of molecular alignment, toward collection of linearly independent restraints, can be introduced by reducing the negative charge of DNA nanotubes using counter ions and small DNA-binding molecules. This detergent-resistant liquid-crystal medium offers a number of properties conducive for membrane protein alignment, including high-yield production, thermal stability, buffer compatibility and structural programmability. Production of sufficient nanotubes for four or five NMR experiments can be completed in 1 week by a single individual.
Schmidt, Thomas H; Kandt, Christian
2012-10-22
At the beginning of each molecular dynamics membrane simulation stands the generation of a suitable starting structure which includes the working steps of aligning membrane and protein and seamlessly accommodating the protein in the membrane. Here we introduce two efficient and complementary methods based on pre-equilibrated membrane patches, automating these steps. Using a voxel-based cast of the coarse-grained protein, LAMBADA computes a hydrophilicity profile-derived scoring function based on which the optimal rotation and translation operations are determined to align protein and membrane. Employing an entirely geometrical approach, LAMBADA is independent from any precalculated data and aligns even large membrane proteins within minutes on a regular workstation. LAMBADA is the first tool performing the entire alignment process automatically while providing the user with the explicit 3D coordinates of the aligned protein and membrane. The second tool is an extension of the InflateGRO method addressing the shortcomings of its predecessor in a fully automated workflow. Determining the exact number of overlapping lipids based on the area occupied by the protein and restricting expansion, compression and energy minimization steps to a subset of relevant lipids through automatically calculated and system-optimized operation parameters, InflateGRO2 yields optimal lipid packing and reduces lipid vacuum exposure to a minimum preserving as much of the equilibrated membrane structure as possible. Applicable to atomistic and coarse grain structures in MARTINI format, InflateGRO2 offers high accuracy, fast performance, and increased application flexibility permitting the easy preparation of systems exhibiting heterogeneous lipid composition as well as embedding proteins into multiple membranes. Both tools can be used separately, in combination with other methods, or in tandem permitting a fully automated workflow while retaining a maximum level of usage control and flexibility. To assess the performance of both methods, we carried out test runs using 22 membrane proteins of different size and transmembrane structure.
Projected power iteration for network alignment
NASA Astrophysics Data System (ADS)
Onaran, Efe; Villar, Soledad
2017-08-01
The network alignment problem asks for the best correspondence between two given graphs, so that the largest possible number of edges are matched. This problem appears in many scientific problems (like the study of protein-protein interactions) and it is very closely related to the quadratic assignment problem which has graph isomorphism, traveling salesman and minimum bisection problems as particular cases. The graph matching problem is NP-hard in general. However, under some restrictive models for the graphs, algorithms can approximate the alignment efficiently. In that spirit the recent work by Feizi and collaborators introduce EigenAlign, a fast spectral method with convergence guarantees for Erd-s-Renyí graphs. In this work we propose the algorithm Projected Power Alignment, which is a projected power iteration version of EigenAlign. We numerically show it improves the recovery rates of EigenAlign and we describe the theory that may be used to provide performance guarantees for Projected Power Alignment.
G protein-coupled odorant receptors: From sequence to structure.
de March, Claire A; Kim, Soo-Kyung; Antonczak, Serge; Goddard, William A; Golebiowski, Jérôme
2015-09-01
Odorant receptors (ORs) are the largest subfamily within class A G protein-coupled receptors (GPCRs). No experimental structural data of any OR is available to date and atomic-level insights are likely to be obtained by means of molecular modeling. In this article, we critically align sequences of ORs with those GPCRs for which a structure is available. Here, an alignment consistent with available site-directed mutagenesis data on various ORs is proposed. Using this alignment, the choice of the template is deemed rather minor for identifying residues that constitute the wall of the binding cavity or those involved in G protein recognition. © 2015 The Protein Society.
Alignment limit of the NMSSM Higgs sector
Carena, Marcela; Haber, Howard E.; Low, Ian; ...
2016-02-17
The Next-to-Minimal Supersymmetric extension of the Standard Model (NMSSM) with a Higgs boson of mass 125 GeV can be compatible with stop masses of order of the electroweak scale, thereby reducing the degree of fine-tuning necessary to achieve electroweak symmetry breaking. Moreover, in an attractive region of the NMSSM parameter space, corresponding to the \\alignment limit" in which one of the neutral Higgs fields lies approximately in the same direction in field space as the doublet Higgs vacuum expectation value, the observed Higgs boson is predicted to have Standard- Model-like properties. We derive analytical expressions for the alignment conditions andmore » show that they point toward a more natural region of parameter space for electroweak symmetry breaking, while allowing for perturbativity of the theory up to the Planck scale. Additionally, the alignment limit in the NMSSM leads to a well defined spectrum in the Higgs and Higgsino sectors, and yields a rich and interesting Higgs boson phenomenology that can be tested at the LHC. Here, we discuss the most promising channels for discovery and present several benchmark points for further study.« less
The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures.
Goldenberg, Ofir; Erez, Elana; Nimrod, Guy; Ben-Tal, Nir
2009-01-01
ConSurf-DB is a repository for evolutionary conservation analysis of the proteins of known structures in the Protein Data Bank (PDB). Sequence homologues of each of the PDB entries were collected and aligned using standard methods. The evolutionary conservation of each amino acid position in the alignment was calculated using the Rate4Site algorithm, implemented in the ConSurf web server. The algorithm takes into account the phylogenetic relations between the aligned proteins and the stochastic nature of the evolutionary process explicitly. Rate4Site assigns a conservation level for each position in the multiple sequence alignment using an empirical Bayesian inference. Visual inspection of the conservation patterns on the 3D structure often enables the identification of key residues that comprise the functionally important regions of the protein. The repository is updated with the latest PDB entries on a monthly basis and will be rebuilt annually. ConSurf-DB is available online at http://consurfdb.tau.ac.il/
The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures
Goldenberg, Ofir; Erez, Elana; Nimrod, Guy; Ben-Tal, Nir
2009-01-01
ConSurf-DB is a repository for evolutionary conservation analysis of the proteins of known structures in the Protein Data Bank (PDB). Sequence homologues of each of the PDB entries were collected and aligned using standard methods. The evolutionary conservation of each amino acid position in the alignment was calculated using the Rate4Site algorithm, implemented in the ConSurf web server. The algorithm takes into account the phylogenetic relations between the aligned proteins and the stochastic nature of the evolutionary process explicitly. Rate4Site assigns a conservation level for each position in the multiple sequence alignment using an empirical Bayesian inference. Visual inspection of the conservation patterns on the 3D structure often enables the identification of key residues that comprise the functionally important regions of the protein. The repository is updated with the latest PDB entries on a monthly basis and will be rebuilt annually. ConSurf-DB is available online at http://consurfdb.tau.ac.il/ PMID:18971256
Using structure to explore the sequence alignment space of remote homologs.
Kuziemko, Andrew; Honig, Barry; Petrey, Donald
2011-10-01
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.
Vamparys, Lydie; Laurent, Benoist; Carbone, Alessandra
2016-01-01
ABSTRACT Protein–protein interactions play a key part in most biological processes and understanding their mechanism is a fundamental problem leading to numerous practical applications. The prediction of protein binding sites in particular is of paramount importance since proteins now represent a major class of therapeutic targets. Amongst others methods, docking simulations between two proteins known to interact can be a useful tool for the prediction of likely binding patches on a protein surface. From the analysis of the protein interfaces generated by a massive cross‐docking experiment using the 168 proteins of the Docking Benchmark 2.0, where all possible protein pairs, and not only experimental ones, have been docked together, we show that it is also possible to predict a protein's binding residues without having any prior knowledge regarding its potential interaction partners. Evaluating the performance of cross‐docking predictions using the area under the specificity‐sensitivity ROC curve (AUC) leads to an AUC value of 0.77 for the complete benchmark (compared to the 0.5 AUC value obtained for random predictions). Furthermore, a new clustering analysis performed on the binding patches that are scattered on the protein surface show that their distribution and growth will depend on the protein's functional group. Finally, in several cases, the binding‐site predictions resulting from the cross‐docking simulations will lead to the identification of an alternate interface, which corresponds to the interaction with a biomolecular partner that is not included in the original benchmark. Proteins 2016; 84:1408–1421. © 2016 The Authors Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:27287388
Xu, Qifang; Dunbrack, Roland L
2012-11-01
Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.
QuickProbs 2: Towards rapid construction of high-quality alignments of large protein families
Gudyś, Adam; Deorowicz, Sebastian
2017-01-01
The ever-increasing size of sequence databases caused by the development of high throughput sequencing, poses to multiple alignment algorithms one of the greatest challenges yet. As we show, well-established techniques employed for increasing alignment quality, i.e., refinement and consistency, are ineffective when large protein families are investigated. We present QuickProbs 2, an algorithm for multiple sequence alignment. Based on probabilistic models, equipped with novel column-oriented refinement and selective consistency, it offers outstanding accuracy. When analysing hundreds of sequences, Quick-Probs 2 is noticeably better than ClustalΩ and MAFFT, the previous leaders for processing numerous protein families. In the case of smaller sets, for which consistency-based methods are the best performing, QuickProbs 2 is also superior to the competitors. Due to low computational requirements of selective consistency and utilization of massively parallel architectures, presented algorithm has similar execution times to ClustalΩ, and is orders of magnitude faster than full consistency approaches, like MSAProbs or PicXAA. All these make QuickProbs 2 an excellent tool for aligning families ranging from few, to hundreds of proteins. PMID:28139687
A CPU benchmark for protein crystallographic refinement.
Bourne, P E; Hendrickson, W A
1990-01-01
The CPU time required to complete a cycle of restrained least-squares refinement of a protein structure from X-ray crystallographic data using the FORTRAN codes PROTIN and PROLSQ are reported for 48 different processors, ranging from single-user workstations to supercomputers. Sequential, vector, VLIW, multiprocessor, and RISC hardware architectures are compared using both a small and a large protein structure. Representative compile times for each hardware type are also given, and the improvement in run-time when coding for a specific hardware architecture considered. The benchmarks involve scalar integer and vector floating point arithmetic and are representative of the calculations performed in many scientific disciplines.
Ajawatanawong, Pravech; Atkinson, Gemma C; Watson-Haigh, Nathan S; Mackenzie, Bryony; Baldauf, Sandra L
2012-07-01
Analyses of multiple sequence alignments generally focus on well-defined conserved sequence blocks, while the rest of the alignment is largely ignored or discarded. This is especially true in phylogenomics, where large multigene datasets are produced through automated pipelines. However, some of the most powerful phylogenetic markers have been found in the variable length regions of multiple alignments, particularly insertions/deletions (indels) in protein sequences. We have developed Sequence Feature and Indel Region Extractor (SeqFIRE) to enable the automated identification and extraction of indels from protein sequence alignments. The program can also extract conserved blocks and identify fast evolving sites using a combination of conservation and entropy. All major variables can be adjusted by the user, allowing them to identify the sets of variables most suited to a particular analysis or dataset. Thus, all major tasks in preparing an alignment for further analysis are combined in a single flexible and user-friendly program. The output includes a numbered list of indels, alignments in NEXUS format with indels annotated or removed and indel-only matrices. SeqFIRE is a user-friendly web application, freely available online at www.seqfire.org/.
From concepts to clinical reality: an essay on the benchmarking of biomedical terminologies.
Smith, Barry
2006-06-01
It is only by fixing on agreed meanings of terms in biomedical terminologies that we will be in a position to achieve that accumulation and integration of knowledge that is indispensable to progress at the frontiers of biomedicine. Standardly, the goal of fixing meanings is seen as being realized through the alignment of terms on what are called 'concepts.' Part I addresses three versions of the concept-based approach--by Cimino, by Wüster, and by Campbell and associates--and surveys some of the problems to which they give rise, all of which have to do with a failure to anchor the terms in terminologies to corresponding referents in reality. Part II outlines a new, realist solution to this anchorage problem, which sees terminology construction as being motivated by the goal of alignment not on concepts but on the universals (kinds, types) in reality and thereby also on the corresponding instances (individuals, tokens). We outline the realist approach and show how on its basis we can provide a benchmark of correctness for terminologies which will at the same time allow a new type of integration of terminologies and electronic health records. We conclude by outlining ways in which the framework thus defined might be exploited for purposes of diagnostic decision-support.
Nema, Vijay; Pal, Sudhir Kumar
2013-01-01
This study was conducted to find the best suited freely available software for modelling of proteins by taking a few sample proteins. The proteins used were small to big in size with available crystal structures for the purpose of benchmarking. Key players like Phyre2, Swiss-Model, CPHmodels-3.0, Homer, (PS)2, (PS)(2)-V(2), Modweb were used for the comparison and model generation. Benchmarking process was done for four proteins, Icl, InhA, and KatG of Mycobacterium tuberculosis and RpoB of Thermus Thermophilus to get the most suited software. Parameters compared during analysis gave relatively better values for Phyre2 and Swiss-Model. This comparative study gave the information that Phyre2 and Swiss-Model make good models of small and large proteins as compared to other screened software. Other software was also good but is often not very efficient in providing full-length and properly folded structure.
A low-complexity add-on score for protein remote homology search with COMER.
Margelevicius, Mindaugas
2018-06-15
Protein sequence alignment forms the basis for comparative modeling, the most reliable approach to protein structure prediction, among many other applications. Alignment between sequence families, or profile-profile alignment, represents one of the most, if not the most, sensitive means for homology detection but still necessitates improvement. We aim at improving the quality of profile-profile alignments and the sensitivity induced by them by refining profile-profile substitution scores. We have developed a new score that represents an additional component of profile-profile substitution scores. A comprehensive evaluation shows that the new add-on score statistically significantly improves both the sensitivity and the alignment quality of the COMER method. We discuss why the score leads to the improvement and its almost optimal computational complexity that makes it easily implementable in any profile-profile alignment method. An implementation of the add-on score in the open-source COMER software and data are available at https://sourceforge.net/projects/comer. The COMER software is also available on Github at https://github.com/minmarg/comer and as a Docker image (minmar/comer). Supplementary data are available at Bioinformatics online.
GuiTope: an application for mapping random-sequence peptides to protein sequences.
Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert
2012-01-03
Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.
Local-global alignment for finding 3D similarities in protein structures
Zemla, Adam T [Brentwood, CA
2011-09-20
A method of finding 3D similarities in protein structures of a first molecule and a second molecule. The method comprises providing preselected information regarding the first molecule and the second molecule. Comparing the first molecule and the second molecule using Longest Continuous Segments (LCS) analysis. Comparing the first molecule and the second molecule using Global Distance Test (GDT) analysis. Comparing the first molecule and the second molecule using Local Global Alignment Scoring function (LGA_S) analysis. Verifying constructed alignment and repeating the steps to find the regions of 3D similarities in protein structures.
Chekmenev, Eduard Y; Hu, Jun; Gor'kov, Peter L; Brey, William W; Cross, Timothy A; Ruuge, Andres; Smirnov, Alex I
2005-04-01
This communication reports the first example of a high resolution solid-state 15N 2D PISEMA NMR spectrum of a transmembrane peptide aligned using hydrated cylindrical lipid bilayers formed inside nanoporous anodic aluminum oxide (AAO) substrates. The transmembrane domain SSDPLVVA(A-15N)SIIGILHLILWILDRL of M2 protein from influenza A virus was reconstituted in hydrated 1,2-dimyristoyl-sn-glycero-3-phosphatidylcholine bilayers that were macroscopically aligned by a conventional micro slide glass support or by the AAO nanoporous substrate. 15N and 31P NMR spectra demonstrate that both the phospholipids and the protein transmembrane domain are uniformly aligned in the nanopores. Importantly, nanoporous AAO substrates may offer several advantages for membrane protein alignment in solid-state NMR studies compared to conventional methods. Specifically, higher thermal conductivity of aluminum oxide is expected to suppress thermal gradients associated with inhomogeneous radio frequency heating. Another important advantage of the nanoporous AAO substrate is its excellent accessibility to the bilayer surface for exposure to solute molecules. Such high accessibility achieved through the substrate nanochannel network could facilitate a wide range of structure-function studies of membrane proteins by solid-state NMR.
Dunbrack, Roland L.
2012-01-01
Motivation: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. Results: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM–HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. Availability: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly. Contact: Roland.Dunbracks@fccc.edu PMID:22942020
Multiple sequence alignment in HTML: colored, possibly hyperlinked, compact representations.
Campagne, F; Maigret, B
1998-02-01
Protein sequence alignments are widely used in protein structure prediction, protein engineering, modeling of proteins, etc. This type of representation is useful at different stages of scientific activity: looking at previous results, working on a research project, and presenting the results. There is a need to make it available through a network (intranet or WWW), in a way that allows biologists, chemists, and noncomputer specialists to look at the data and carry on research--possibly in a collaborative research. Previous methods (text-based, Java-based) are reported and their advantages are discussed. We have developed two novel approaches to represent the alignments as colored, hyper-linked HTML pages. The first method creates an HTML page that uses efficiently the image cache mechanism of a WWW browser, thereby allowing the user to browse different alignments without waiting for the images to be loaded through the network, but only for the first viewed alignment. The generated pages can be browsed with any HTML2.0-compliant browser. The second method that we propose uses W3C-CSS1-style sheets to render alignments. This new method generates pages that require recent browsers to be viewed. We implemented these methods in the Viseur program and made a WWW service available that allows a user to convert an MSF alignment file in HTML for WWW publishing. The latter service is available at http:@www.lctn.u-nancy.fr/viseur/services.htm l.
Kawabata, Takeshi; Nakamura, Haruki
2014-07-28
A protein-bound conformation of a target molecule can be predicted by aligning the target molecule on the reference molecule obtained from the 3D structure of the compound-protein complex. This strategy is called "similarity-based docking". For this purpose, we develop the flexible alignment program fkcombu, which aligns the target molecule based on atomic correspondences with the reference molecule. The correspondences are obtained by the maximum common substructure (MCS) of 2D chemical structures, using our program kcombu. The prediction performance was evaluated using many target-reference pairs of superimposed ligand 3D structures on the same protein in the PDB, with different ranges of chemical similarity. The details of atomic correspondence largely affected the prediction success. We found that topologically constrained disconnected MCS (TD-MCS) with the simple element-based atomic classification provides the best prediction. The crashing potential energy with the receptor protein improved the performance. We also found that the RMSD between the predicted and correct target conformations significantly correlates with the chemical similarities between target-reference molecules. Generally speaking, if the reference and target compounds have more than 70% chemical similarity, then the average RMSD of 3D conformations is <2.0 Å. We compared the performance with a rigid-body molecular alignment program based on volume-overlap scores (ShaEP). Our MCS-based flexible alignment program performed better than the rigid-body alignment program, especially when the target and reference molecules were sufficiently similar.
Heuristics for multiobjective multiple sequence alignment.
Abbasi, Maryam; Paquete, Luís; Pereira, Francisco B
2016-07-15
Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment. We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show that our approaches can obtain better results than TCoffee and Clustal Omega in terms of the first ratio.
HIA: a genome mapper using hybrid index-based sequence alignment.
Choi, Jongpill; Park, Kiejung; Cho, Seong Beom; Chung, Myungguen
2015-01-01
A number of alignment tools have been developed to align sequencing reads to the human reference genome. The scale of information from next-generation sequencing (NGS) experiments, however, is increasing rapidly. Recent studies based on NGS technology have routinely produced exome or whole-genome sequences from several hundreds or thousands of samples. To accommodate the increasing need of analyzing very large NGS data sets, it is necessary to develop faster, more sensitive and accurate mapping tools. HIA uses two indices, a hash table index and a suffix array index. The hash table performs direct lookup of a q-gram, and the suffix array performs very fast lookup of variable-length strings by exploiting binary search. We observed that combining hash table and suffix array (hybrid index) is much faster than the suffix array method for finding a substring in the reference sequence. Here, we defined the matching region (MR) is a longest common substring between a reference and a read. And, we also defined the candidate alignment regions (CARs) as a list of MRs that is close to each other. The hybrid index is used to find candidate alignment regions (CARs) between a reference and a read. We found that aligning only the unmatched regions in the CAR is much faster than aligning the whole CAR. In benchmark analysis, HIA outperformed in mapping speed compared with the other aligners, without significant loss of mapping accuracy. Our experiments show that the hybrid of hash table and suffix array is useful in terms of speed for mapping NGS sequencing reads to the human reference genome sequence. In conclusion, our tool is appropriate for aligning massive data sets generated by NGS sequencing.
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.
Daily, Jeff
2016-02-10
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. A faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar's 'striped' approach. Rognes's SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail's prefix scan implementation is generally the fastest, faster even than Farrar's 'striped' approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. Applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
7TMRmine: a Web server for hierarchical mining of 7TMR proteins
Lu, Guoqing; Wang, Zhifang; Jones, Alan M; Moriyama, Etsuko N
2009-01-01
Background Seven-transmembrane region-containing receptors (7TMRs) play central roles in eukaryotic signal transduction. Due to their biomedical importance, thorough mining of 7TMRs from diverse genomes has been an active target of bioinformatics and pharmacogenomics research. The need for new and accurate 7TMR/GPCR prediction tools is paramount with the accelerated rate of acquisition of diverse sequence information. Currently available and often used protein classification methods (e.g., profile hidden Markov Models) are highly accurate for identifying their membership information among already known 7TMR subfamilies. However, these alignment-based methods are less effective for identifying remote similarities, e.g., identifying proteins from highly divergent or possibly new 7TMR families. In this regard, more sensitive (e.g., alignment-free) methods are needed to complement the existing protein classification methods. A better strategy would be to combine different classifiers, from more specific to more sensitive methods, to identify a broader spectrum of 7TMR protein candidates. Description We developed a Web server, 7TMRmine, by integrating alignment-free and alignment-based classifiers specifically trained to identify candidate 7TMR proteins as well as transmembrane (TM) prediction methods. This new tool enables researchers to easily assess the distribution of GPCR functionality in diverse genomes or individual newly-discovered proteins. 7TMRmine is easily customized and facilitates exploratory analysis of diverse genomes. Users can integrate various alignment-based, alignment-free, and TM-prediction methods in any combination and in any hierarchical order. Sixteen classifiers (including two TM-prediction methods) are available on the 7TMRmine Web server. Not only can the 7TMRmine tool be used for 7TMR mining, but also for general TM-protein analysis. Users can submit protein sequences for analysis, or explore pre-analyzed results for multiple genomes. The server currently includes prediction results and the summary statistics for 68 genomes. Conclusion 7TMRmine facilitates the discovery of 7TMR proteins. By combining prediction results from different classifiers in a multi-level filtering process, prioritized sets of 7TMR candidates can be obtained for further investigation. 7TMRmine can be also used as a general TM-protein classifier. Comparisons of TM and 7TMR protein distributions among 68 genomes revealed interesting differences in evolution of these protein families among major eukaryotic phyla. PMID:19538753
KISS for STRAP: user extensions for a protein alignment editor.
Gille, Christoph; Lorenzen, Stephan; Michalsky, Elke; Frömmel, Cornelius
2003-12-12
The Structural Alignment Program STRAP is a comfortable comprehensive editor and analyzing tool for protein alignments. A wide range of functions related to protein sequences and protein structures are accessible with an intuitive graphical interface. Recent features include mapping of mutations and polymorphisms onto structures and production of high quality figures for publication. Here we address the general problem of multi-purpose program packages to keep up with the rapid development of bioinformatical methods and the demand for specific program functions. STRAP was remade implementing a novel design which aims at Keeping Interfaces in STRAP Simple (KISS). KISS renders STRAP extendable to bio-scientists as well as to bio-informaticians. Scientists with basic computer skills are capable of implementing statistical methods or embedding existing bioinformatical tools in STRAP themselves. For bio-informaticians STRAP may serve as an environment for rapid prototyping and testing of complex algorithms such as automatic alignment algorithms or phylogenetic methods. Further, STRAP can be applied as an interactive web applet to present data related to a particular protein family and as a teaching tool. JAVA-1.4 or higher. http://www.charite.de/bioinf/strap/
Overcoming Sequence Misalignments with Weighted Structural Superposition
Khazanov, Nickolay A.; Damm-Ganamet, Kelly L.; Quang, Daniel X.; Carlson, Heather A.
2012-01-01
An appropriate structural superposition identifies similarities and differences between homologous proteins that are not evident from sequence alignments alone. We have coupled our Gaussian-weighted RMSD (wRMSD) tool with a sequence aligner and seed extension (SE) algorithm to create a robust technique for overlaying structures and aligning sequences of homologous proteins (HwRMSD). HwRMSD overcomes errors in the initial sequence alignment that would normally propagate into a standard RMSD overlay. SE can generate a corrected sequence alignment from the improved structural superposition obtained by wRMSD. HwRMSD’s robust performance and its superiority over standard RMSD are demonstrated over a range of homologous proteins. Its better overlay results in corrected sequence alignments with good agreement to HOMSTRAD. Finally, HwRMSD is compared to established structural alignment methods: FATCAT, SSM, CE, and Dalilite. Most methods are comparable at placing residue pairs within 2 Å, but HwRMSD places many more residue pairs within 1 Å, providing a clear advantage. Such high accuracy is essential in drug design, where small distances can have a large impact on computational predictions. This level of accuracy is also needed to correct sequence alignments in an automated fashion, especially for omics-scale analysis. HwRMSD can align homologs with low sequence identity and large conformational differences, cases where both sequence-based and structural-based methods may fail. The HwRMSD pipeline overcomes the dependency of structural overlays on initial sequence pairing and removes the need to determine the best sequence-alignment method, substitution matrix, and gap parameters for each unique pair of homologs. PMID:22733542
Cardon, Thomas B; Tiburu, Elvis K; Lorigan, Gary A
2003-03-01
Our lab is developing a spin-labeled EPR spectroscopic technique complementary to solid-state NMR studies to study the structure, orientation, and dynamics of uniaxially aligned integral membrane proteins inserted into magnetically aligned discotic phospholipid bilayers, or bicelles. The focus of this study is to optimize and understand the mechanisms involved in the magnetic alignment process of bicelle disks in weak magnetic fields. Developing experimental conditions for optimized magnetic alignment of bicelles in low magnetic fields may prove useful to study the dynamics of membrane proteins and its interactions with lipids, drugs, steroids, signaling events, other proteins, etc. In weak magnetic fields, the magnetic alignment of Tm(3+)-doped bicelle disks was thermodynamically and kinetically very sensitive to experimental conditions. Tm(3+)-doped bicelles were magnetically aligned using the following optimized procedure: the temperature was slowly raised at a rate of 1.9K/min from an initial temperature being between 298 and 307K to a final temperature of 318K in the presence of a static magnetic field of 6300G. The spin probe 3beta-doxyl-5alpha-cholestane (cholestane) was inserted into the bicelle disks and utilized to monitor bicelle alignment by analyzing the anisotropic hyperfine splitting for the corresponding EPR spectra. The phases of the bicelles were determined using solid-state 2H NMR spectroscopy and compared with the corresponding EPR spectra. Macroscopic alignment commenced in the liquid crystalline nematic phase (307K), continued to increase upon slowly raising the temperature, and was well-aligned in the liquid crystalline lamellar smectic phase (318K).
Walther, Dirk; Bartha, Gábor; Morris, Macdonald
2001-01-01
A pivotal step in electrophoresis sequencing is the conversion of the raw, continuous chromatogram data into the actual sequence of discrete nucleotides, a process referred to as basecalling. We describe a novel algorithm for basecalling implemented in the program LifeTrace. Like Phred, currently the most widely used basecalling software program, LifeTrace takes processed trace data as input. It was designed to be tolerant to variable peak spacing by means of an improved peak-detection algorithm that emphasizes local chromatogram information over global properties. LifeTrace is shown to generate high-quality basecalls and reliable quality scores. It proved particularly effective when applied to MegaBACE capillary sequencing machines. In a benchmark test of 8372 dye-primer MegaBACE chromatograms, LifeTrace generated 17% fewer substitution errors, 16% fewer insertion/deletion errors, and 2.4% more aligned bases to the finished sequence than did Phred. For two sets totaling 6624 dye-terminator chromatograms, the performance improvement was 15% fewer substitution errors, 10% fewer insertion/deletion errors, and 2.1% more aligned bases. The processing time required by LifeTrace is comparable to that of Phred. The predicted quality scores were in line with observed quality scores, permitting direct use for quality clipping and in silico single nucleotide polymorphism (SNP) detection. Furthermore, we introduce a new type of quality score associated with every basecall: the gap-quality. It estimates the probability of a deletion error between the current and the following basecall. This additional quality score improves detection of single basepair deletions when used for locating potential basecalling errors during the alignment. We also describe a new protocol for benchmarking that we believe better discerns basecaller performance differences than methods previously published. PMID:11337481
Reactor Pressure Vessel Fracture Analysis Capabilities in Grizzly
DOE Office of Scientific and Technical Information (OSTI.GOV)
Spencer, Benjamin; Backman, Marie; Chakraborty, Pritam
2015-03-01
Efforts have been underway to develop fracture mechanics capabilities in the Grizzly code to enable it to be used to perform deterministic fracture assessments of degraded reactor pressure vessels (RPVs). Development in prior years has resulted a capability to calculate -integrals. For this application, these are used to calculate stress intensity factors for cracks to be used in deterministic linear elastic fracture mechanics (LEFM) assessments of fracture in degraded RPVs. The -integral can only be used to evaluate stress intensity factors for axis-aligned flaws because it can only be used to obtain the stress intensity factor for pure Mode Imore » loading. Off-axis flaws will be subjected to mixed-mode loading. For this reason, work has continued to expand the set of fracture mechanics capabilities to permit it to evaluate off-axis flaws. This report documents the following work to enhance Grizzly’s engineering fracture mechanics capabilities for RPVs: • Interaction Integral and -stress: To obtain mixed-mode stress intensity factors, a capability to evaluate interaction integrals for 2D or 3D flaws has been developed. A -stress evaluation capability has been developed to evaluate the constraint at crack tips in 2D or 3D. Initial verification testing of these capabilities is documented here. • Benchmarking for axis-aligned flaws: Grizzly’s capabilities to evaluate stress intensity factors for axis-aligned flaws have been benchmarked against calculations for the same conditions in FAVOR. • Off-axis flaw demonstration: The newly-developed interaction integral capabilities are demon- strated in an application to calculate the mixed-mode stress intensity factors for off-axis flaws. • Other code enhancements: Other enhancements to the thermomechanics capabilities that relate to the solution of the engineering RPV fracture problem are documented here.« less
Rigorous electromagnetic simulation applied to alignment systems
NASA Astrophysics Data System (ADS)
Deng, Yunfei; Pistor, Thomas V.; Neureuther, Andrew R.
2001-09-01
Rigorous electromagnetic simulation with TEMPEST is used to provide benchmark data and understanding of key parameters in the design of topographical features of alignment marks. Periodic large silicon trenches are analyzed as a function of wavelength (530-800 nm), duty cycle, depth, slope and angle of incidence. The signals are well behaved except when the trench width becomes about 1 micrometers or smaller. Segmentation of the trenches to form 3D marks shows that a segmentation period of 2-5 wavelengths makes the diffraction in the (1,1) direction about 1/3 to 1/2 of that in the main first order (1,0). Transmission alignment marks nanoimprint lithography using the difference between the +1 and -1 reflected orders showed a sensitivity of the difference signal to misalignment of 0.7%/nm for rigorous simulation and 0.5%/nm for simple ray-tracing. The sensitivity to a slanted substrate indentation was 10 nm off-set per degree of tilt from horizontal.
On the orientation of the backbone dipoles in native folds
Ripoll, Daniel R.; Vila, Jorge A.; Scheraga, Harold A.
2005-01-01
The role of electrostatic interactions in determining the native fold of proteins has been investigated by analyzing the alignment of peptide bond dipole moments with the local electrostatic field generated by the rest of the molecule with and without solvent effects. This alignment was calculated for a set of 112 native proteins by using charges from a gas phase potential. Most of the peptide dipoles in this set of proteins are on average aligned with the electrostatic field. The dipole moments associated with α-helical conformations show the best alignment with the electrostatic field, followed by residues in β-strand conformations. The dipole moments associated with other secondary structure elements are on average better aligned than in randomly generated conformations. The alignment of a dipole with the local electrostatic field depends on both the topology of the native fold and the charge distribution assumed for all of the residues. The influences of (i) solvent effects, (ii) different sets of charges, and (iii) the charge distribution assumed for the whole molecule were examined with a subset of 22 proteins each of which contains <30 ionizable groups. The results show that alternative charge distribution models lead to significant differences among the associated electrostatic fields, whereas the electrostatic field is less sensitive to the particular set of the adopted charges themselves (empirical conformational energy program for peptides or parameters for solvation energy). PMID:15894608
Abriata, Luciano A; Bovigny, Christophe; Dal Peraro, Matteo
2016-06-17
Protein variability can now be studied by measuring high-resolution tolerance-to-substitution maps and fitness landscapes in saturated mutational libraries. But these rich and expensive datasets are typically interpreted coarsely, restricting detailed analyses to positions of extremely high or low variability or dubbed important beforehand based on existing knowledge about active sites, interaction surfaces, (de)stabilizing mutations, etc. Our new webserver PsychoProt (freely available without registration at http://psychoprot.epfl.ch or at http://lucianoabriata.altervista.org/psychoprot/index.html ) helps to detect, quantify, and sequence/structure map the biophysical and biochemical traits that shape amino acid preferences throughout a protein as determined by deep-sequencing of saturated mutational libraries or from large alignments of naturally occurring variants. We exemplify how PsychoProt helps to (i) unveil protein structure-function relationships from experiments and from alignments that are consistent with structures according to coevolution analysis, (ii) recall global information about structural and functional features and identify hitherto unknown constraints to variation in alignments, and (iii) point at different sources of variation among related experimental datasets or between experimental and alignment-based data. Remarkably, metabolic costs of the amino acids pose strong constraints to variability at protein surfaces in nature but not in the laboratory. This and other differences call for caution when extrapolating results from in vitro experiments to natural scenarios in, for example, studies of protein evolution. We show through examples how PsychoProt can be a useful tool for the broad communities of structural biology and molecular evolution, particularly for studies about protein modeling, evolution and design.
G protein-coupled odorant receptors: From sequence to structure
de March, Claire A; Kim, Soo-Kyung; Antonczak, Serge; Goddard, William A; Golebiowski, Jérôme
2015-01-01
Odorant receptors (ORs) are the largest subfamily within class A G protein-coupled receptors (GPCRs). No experimental structural data of any OR is available to date and atomic-level insights are likely to be obtained by means of molecular modeling. In this article, we critically align sequences of ORs with those GPCRs for which a structure is available. Here, an alignment consistent with available site-directed mutagenesis data on various ORs is proposed. Using this alignment, the choice of the template is deemed rather minor for identifying residues that constitute the wall of the binding cavity or those involved in G protein recognition. PMID:26044705
A protein block based fold recognition method for the annotation of twilight zone sequences.
Suresh, V; Ganesan, K; Parthasarathy, S
2013-03-01
The description of protein backbone was recently improved with a group of structural fragments called Structural Alphabets instead of the regular three states (Helix, Sheet and Coil) secondary structure description. Protein Blocks is one of the Structural Alphabets used to describe each and every region of protein backbone including the coil. According to de Brevern (2000) the Protein Blocks has 16 structural fragments and each one has 5 residues in length. Protein Blocks fragments are highly informative among the available Structural Alphabets and it has been used for many applications. Here, we present a protein fold recognition method based on Protein Blocks for the annotation of twilight zone sequences. In our method, we align the predicted Protein Blocks of a query amino acid sequence with a library of assigned Protein Blocks of 953 known folds using the local pair-wise alignment. The alignment results with z-value ≥ 2.5 and P-value ≤ 0.08 are predicted as possible folds. Our method is able to recognize the possible folds for nearly 35.5% of the twilight zone sequences with their predicted Protein Block sequence obtained by pb_prediction, which is available at Protein Block Export server.
2013-01-01
Background While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants. Results The amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last. Conclusions While amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still – on average – surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side. PMID:24059743
HomPPI: a class of sequence homology based protein-protein interface prediction methods
2011-01-01
Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/. Conclusions Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners. PMID:21682895
A novel approach to multiple sequence alignment using hadoop data grids.
Sudha Sadasivam, G; Baktavatchalam, G
2010-01-01
Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences.
fRMSDPred: Predicting Local RMSD Between Structural Fragments Using Sequence Information
2007-04-04
machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel
Multiple sequence alignment using multi-objective based bacterial foraging optimization algorithm.
Rani, R Ranjani; Ramyachitra, D
2016-12-01
Multiple sequence alignment (MSA) is a widespread approach in computational biology and bioinformatics. MSA deals with how the sequences of nucleotides and amino acids are sequenced with possible alignment and minimum number of gaps between them, which directs to the functional, evolutionary and structural relationships among the sequences. Still the computation of MSA is a challenging task to provide an efficient accuracy and statistically significant results of alignments. In this work, the Bacterial Foraging Optimization Algorithm was employed to align the biological sequences which resulted in a non-dominated optimal solution. It employs Multi-objective, such as: Maximization of Similarity, Non-gap percentage, Conserved blocks and Minimization of gap penalty. BAliBASE 3.0 benchmark database was utilized to examine the proposed algorithm against other methods In this paper, two algorithms have been proposed: Hybrid Genetic Algorithm with Artificial Bee Colony (GA-ABC) and Bacterial Foraging Optimization Algorithm. It was found that Hybrid Genetic Algorithm with Artificial Bee Colony performed better than the existing optimization algorithms. But still the conserved blocks were not obtained using GA-ABC. Then BFO was used for the alignment and the conserved blocks were obtained. The proposed Multi-Objective Bacterial Foraging Optimization Algorithm (MO-BFO) was compared with widely used MSA methods Clustal Omega, Kalign, MUSCLE, MAFFT, Genetic Algorithm (GA), Ant Colony Optimization (ACO), Artificial Bee Colony (ABC), Particle Swarm Optimization (PSO) and Hybrid Genetic Algorithm with Artificial Bee Colony (GA-ABC). The final results show that the proposed MO-BFO algorithm yields better alignment than most widely used methods. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.
Luczak, Brian B; James, Benjamin T; Girgis, Hani Z
2017-12-06
Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. The source code of the benchmarking tool is available as Supplementary Materials. © The Author 2017. Published by Oxford University Press.
Bayesian comparison of protein structures using partial Procrustes distance.
Ejlali, Nasim; Faghihi, Mohammad Reza; Sadeghi, Mehdi
2017-09-26
An important topic in bioinformatics is the protein structure alignment. Some statistical methods have been proposed for this problem, but most of them align two protein structures based on the global geometric information without considering the effect of neighbourhood in the structures. In this paper, we provide a Bayesian model to align protein structures, by considering the effect of both local and global geometric information of protein structures. Local geometric information is incorporated to the model through the partial Procrustes distance of small substructures. These substructures are composed of β-carbon atoms from the side chains. Parameters are estimated using a Markov chain Monte Carlo (MCMC) approach. We evaluate the performance of our model through some simulation studies. Furthermore, we apply our model to a real dataset and assess the accuracy and convergence rate. Results show that our model is much more efficient than previous approaches.
GeneSilico protein structure prediction meta-server.
Kurowski, Michal A; Bujnicki, Janusz M
2003-07-01
Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta.
GeneSilico protein structure prediction meta-server
Kurowski, Michal A.; Bujnicki, Janusz M.
2003-01-01
Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta. PMID:12824313
Chen, Jonathan S.; Reddy, Vamsee; Chen, Joshua H.; Shlykov, Maksim A.; Zheng, Wei Hao; Cho, Jaehoon; Yen, Ming Ren; Saier, Milton H.
2012-01-01
Transport proteins function in the translocation of ions, solutes and macromolecules across cellular and organellar membranes. These integral membrane proteins fall into >600 families as tabulated in the Transporter Classification Database (www.tcdb.org). Recent studies, some of which are reported here, define distant phylogenetic relationships between families with the creation of superfamilies. Several of these are analyzed using a novel set of programs designed to allow reliable prediction of phylogenetic trees when sequence divergence is too great to allow the use of multiple alignments. These new programs, called SuperfamilyTree1 and 2 (SFT1 and 2), allow display of protein and family relationships, respectively, based on thousands of comparative BLAST scores rather than multiple alignments. Superfamilies analyzed include: (1) Aerolysins, (2) RTX Toxins, (3) Defensins, (4) Ion Transporters, (5) Bile/Arsenite/Riboflavin Transporters, (6) Cation: Proton Antiporters, and (7) the Glucose/Fructose/Lactose superfamily within the prokaryotic phosphoenol pyruvate-dependent Phosphotransferase System. In addition to defining the phylogenetic relationships of the proteins and families within these seven superfamilies, evidence is provided showing that the SFT programs outperform programs that are based on multiple alignments whenever sequence divergence of superfamily members is extensive. The SFT programs should be applicable to virtually any superfamily of proteins or nucleic acids. PMID:22286036
HDOCK: a web server for protein–protein and protein–DNA/RNA docking based on a hybrid strategy
Yan, Yumeng; Zhang, Di; Zhou, Pei; Li, Botong
2017-01-01
Abstract Protein–protein and protein–DNA/RNA interactions play a fundamental role in a variety of biological processes. Determining the complex structures of these interactions is valuable, in which molecular docking has played an important role. To automatically make use of the binding information from the PDB in docking, here we have presented HDOCK, a novel web server of our hybrid docking algorithm of template-based modeling and free docking, in which cases with misleading templates can be rescued by the free docking protocol. The server supports protein–protein and protein–DNA/RNA docking and accepts both sequence and structure inputs for proteins. The docking process is fast and consumes about 10–20 min for a docking run. Tested on the cases with weakly homologous complexes of <30% sequence identity from five docking benchmarks, the HDOCK pipeline tied with template-based modeling on the protein–protein and protein–DNA benchmarks and performed better than template-based modeling on the three protein–RNA benchmarks when the top 10 predictions were considered. The performance of HDOCK became better when more predictions were considered. Combining the results of HDOCK and template-based modeling by ranking first of the template-based model further improved the predictive power of the server. The HDOCK web server is available at http://hdock.phys.hust.edu.cn/. PMID:28521030
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands
NASA Astrophysics Data System (ADS)
Lu, Haiyun; Li, Hao; Banu Bte Sm Rashid, Shamima; Leow, Wee Kheng; Liou, Yih-Cherng
Studies of interactions between protein domains and ligands are important in many aspects such as cellular signaling. We present a knowledge-guided approach for docking protein domains and flexible ligands. The approach is applied to the WW domain, a small protein module mediating signaling complexes which have been implicated in diseases such as muscular dystrophy and Liddle’s syndrome. The first stage of the approach employs a substring search for two binding grooves of WW domains and possible binding motifs of peptide ligands based on known features. The second stage aligns the ligand’s peptide backbone to the two binding grooves using a quasi-Newton constrained optimization algorithm. The backbone-aligned ligands produced serve as good starting points to the third stage which uses any flexible docking algorithm to perform the docking. The experimental results demonstrate that the backbone alignment method in the second stage performs better than conventional rigid superposition given two binding constraints. It is also shown that using the backbone-aligned ligands as initial configurations improves the flexible docking in the third stage. The presented approach can also be applied to other protein domains that involve binding of flexible ligand to two or more binding sites.
Nema, Vijay; Pal, Sudhir Kumar
2013-01-01
Aim: This study was conducted to find the best suited freely available software for modelling of proteins by taking a few sample proteins. The proteins used were small to big in size with available crystal structures for the purpose of benchmarking. Key players like Phyre2, Swiss-Model, CPHmodels-3.0, Homer, (PS)2, (PS)2-V2, Modweb were used for the comparison and model generation. Results: Benchmarking process was done for four proteins, Icl, InhA, and KatG of Mycobacterium tuberculosis and RpoB of Thermus Thermophilus to get the most suited software. Parameters compared during analysis gave relatively better values for Phyre2 and Swiss-Model. Conclusion: This comparative study gave the information that Phyre2 and Swiss-Model make good models of small and large proteins as compared to other screened software. Other software was also good but is often not very efficient in providing full-length and properly folded structure. PMID:24023424
Chaturvedi, Palak; Doerfler, Hannes; Jegadeesan, Sridharan; Ghatak, Arindam; Pressman, Etan; Castillejo, Maria Angeles; Wienkoop, Stefanie; Egelhofer, Volker; Firon, Nurit; Weckwerth, Wolfram
2015-11-06
Recently, we have developed a quantitative shotgun proteomics strategy called mass accuracy precursor alignment (MAPA). The MAPA algorithm uses high mass accuracy to bin mass-to-charge (m/z) ratios of precursor ions from LC-MS analyses, determines their intensities, and extracts a quantitative sample versus m/z ratio data alignment matrix from a multitude of samples. Here, we introduce a novel feature of this algorithm that allows the extraction and alignment of proteotypic peptide precursor ions or any other target peptide from complex shotgun proteomics data for accurate quantification of unique proteins. This strategy circumvents the problem of confusing the quantification of proteins due to indistinguishable protein isoforms by a typical shotgun proteomics approach. We applied this strategy to a comparison of control and heat-treated tomato pollen grains at two developmental stages, post-meiotic and mature. Pollen is a temperature-sensitive tissue involved in the reproductive cycle of plants and plays a major role in fruit setting and yield. By LC-MS-based shotgun proteomics, we identified more than 2000 proteins in total for all different tissues. By applying the targeted MAPA data-processing strategy, 51 unique proteins were identified as heat-treatment-responsive protein candidates. The potential function of the identified candidates in a specific developmental stage is discussed.
De novo identification of highly diverged protein repeats by probabilistic consistency.
Biegert, A; Söding, J
2008-03-15
An estimated 25% of all eukaryotic proteins contain repeats, which underlines the importance of duplication for evolving new protein functions. Internal repeats often correspond to structural or functional units in proteins. Methods capable of identifying diverged repeated segments or domains at the sequence level can therefore assist in predicting domain structures, inferring hypotheses about function and mechanism, and investigating the evolution of proteins from smaller fragments. We present HHrepID, a method for the de novo identification of repeats in protein sequences. It is able to detect the sequence signature of structural repeats in many proteins that have not yet been known to possess internal sequence symmetry, such as outer membrane beta-barrels. HHrepID uses HMM-HMM comparison to exploit evolutionary information in the form of multiple sequence alignments of homologs. In contrast to a previous method, the new method (1) generates a multiple alignment of repeats; (2) utilizes the transitive nature of homology through a novel merging procedure with fully probabilistic treatment of alignments; (3) improves alignment quality through an algorithm that maximizes the expected accuracy; (4) is able to identify different kinds of repeats within complex architectures by a probabilistic domain boundary detection method and (5) improves sensitivity through a new approach to assess statistical significance. Server: http://toolkit.tuebingen.mpg.de/hhrepid; Executables: ftp://ftp.tuebingen.mpg.de/pub/protevo/HHrepID
NASA Astrophysics Data System (ADS)
Lindner, Robert; Lou, Xinghua; Reinstein, Jochen; Shoeman, Robert L.; Hamprecht, Fred A.; Winkler, Andreas
2014-06-01
Hydrogen-deuterium exchange (HDX) experiments analyzed by mass spectrometry (MS) provide information about the dynamics and the solvent accessibility of protein backbone amide hydrogen atoms. Continuous improvement of MS instrumentation has contributed to the increasing popularity of this method; however, comprehensive automated data analysis is only beginning to mature. We present Hexicon 2, an automated pipeline for data analysis and visualization based on the previously published program Hexicon (Lou et al. 2010). Hexicon 2 employs the sensitive NITPICK peak detection algorithm of its predecessor in a divide-and-conquer strategy and adds new features, such as chromatogram alignment and improved peptide sequence assignment. The unique feature of deuteration distribution estimation was retained in Hexicon 2 and improved using an iterative deconvolution algorithm that is robust even to noisy data. In addition, Hexicon 2 provides a data browser that facilitates quality control and provides convenient access to common data visualization tasks. Analysis of a benchmark dataset demonstrates superior performance of Hexicon 2 compared with its predecessor in terms of deuteration centroid recovery and deuteration distribution estimation. Hexicon 2 greatly reduces data analysis time compared with manual analysis, whereas the increased number of peptides provides redundant coverage of the entire protein sequence. Hexicon 2 is a standalone application available free of charge under http://hx2.mpimf-heidelberg.mpg.de.
Lindner, Robert; Lou, Xinghua; Reinstein, Jochen; Shoeman, Robert L; Hamprecht, Fred A; Winkler, Andreas
2014-06-01
Hydrogen-deuterium exchange (HDX) experiments analyzed by mass spectrometry (MS) provide information about the dynamics and the solvent accessibility of protein backbone amide hydrogen atoms. Continuous improvement of MS instrumentation has contributed to the increasing popularity of this method; however, comprehensive automated data analysis is only beginning to mature. We present Hexicon 2, an automated pipeline for data analysis and visualization based on the previously published program Hexicon (Lou et al. 2010). Hexicon 2 employs the sensitive NITPICK peak detection algorithm of its predecessor in a divide-and-conquer strategy and adds new features, such as chromatogram alignment and improved peptide sequence assignment. The unique feature of deuteration distribution estimation was retained in Hexicon 2 and improved using an iterative deconvolution algorithm that is robust even to noisy data. In addition, Hexicon 2 provides a data browser that facilitates quality control and provides convenient access to common data visualization tasks. Analysis of a benchmark dataset demonstrates superior performance of Hexicon 2 compared with its predecessor in terms of deuteration centroid recovery and deuteration distribution estimation. Hexicon 2 greatly reduces data analysis time compared with manual analysis, whereas the increased number of peptides provides redundant coverage of the entire protein sequence. Hexicon 2 is a standalone application available free of charge under http://hx2.mpimf-heidelberg.mpg.de.
An automated benchmarking platform for MHC class II binding prediction methods.
Andreatta, Massimo; Trolle, Thomas; Yan, Zhen; Greenbaum, Jason A; Peters, Bjoern; Nielsen, Morten
2018-05-01
Computational methods for the prediction of peptide-MHC binding have become an integral and essential component for candidate selection in experimental T cell epitope discovery studies. The sheer amount of published prediction methods-and often discordant reports on their performance-poses a considerable quandary to the experimentalist who needs to choose the best tool for their research. With the goal to provide an unbiased, transparent evaluation of the state-of-the-art in the field, we created an automated platform to benchmark peptide-MHC class II binding prediction tools. The platform evaluates the absolute and relative predictive performance of all participating tools on data newly entered into the Immune Epitope Database (IEDB) before they are made public, thereby providing a frequent, unbiased assessment of available prediction tools. The benchmark runs on a weekly basis, is fully automated, and displays up-to-date results on a publicly accessible website. The initial benchmark described here included six commonly used prediction servers, but other tools are encouraged to join with a simple sign-up procedure. Performance evaluation on 59 data sets composed of over 10 000 binding affinity measurements suggested that NetMHCIIpan is currently the most accurate tool, followed by NN-align and the IEDB consensus method. Weekly reports on the participating methods can be found online at: http://tools.iedb.org/auto_bench/mhcii/weekly/. mniel@bioinformatics.dtu.dk. Supplementary data are available at Bioinformatics online.
Developing a molecular dynamics force field for both folded and disordered protein states.
Robustelli, Paul; Piana, Stefano; Shaw, David E
2018-05-07
Molecular dynamics (MD) simulation is a valuable tool for characterizing the structural dynamics of folded proteins and should be similarly applicable to disordered proteins and proteins with both folded and disordered regions. It has been unclear, however, whether any physical model (force field) used in MD simulations accurately describes both folded and disordered proteins. Here, we select a benchmark set of 21 systems, including folded and disordered proteins, simulate these systems with six state-of-the-art force fields, and compare the results to over 9,000 available experimental data points. We find that none of the tested force fields simultaneously provided accurate descriptions of folded proteins, of the dimensions of disordered proteins, and of the secondary structure propensities of disordered proteins. Guided by simulation results on a subset of our benchmark, however, we modified parameters of one force field, achieving excellent agreement with experiment for disordered proteins, while maintaining state-of-the-art accuracy for folded proteins. The resulting force field, a99SB- disp , should thus greatly expand the range of biological systems amenable to MD simulation. A similar approach could be taken to improve other force fields. Copyright © 2018 the Author(s). Published by PNAS.
Structure based alignment and clustering of proteins (STRALCP)
Zemla, Adam T.; Zhou, Carol E.; Smith, Jason R.; Lam, Marisa W.
2013-06-18
Disclosed are computational methods of clustering a set of protein structures based on local and pair-wise global similarity values. Pair-wise local and global similarity values are generated based on pair-wise structural alignments for each protein in the set of protein structures. Initially, the protein structures are clustered based on pair-wise local similarity values. The protein structures are then clustered based on pair-wise global similarity values. For each given cluster both a representative structure and spans of conserved residues are identified. The representative protein structure is used to assign newly-solved protein structures to a group. The spans are used to characterize conservation and assign a "structural footprint" to the cluster.
Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign
2007-01-01
Background Joint alignment and secondary structure prediction of two RNA sequences can significantly improve the accuracy of the structural predictions. Methods addressing this problem, however, are forced to employ constraints that reduce computation by restricting the alignments and/or structures (i.e. folds) that are permissible. In this paper, a new methodology is presented for the purpose of establishing alignment constraints based on nucleotide alignment and insertion posterior probabilities. Using a hidden Markov model, posterior probabilities of alignment and insertion are computed for all possible pairings of nucleotide positions from the two sequences. These alignment and insertion posterior probabilities are additively combined to obtain probabilities of co-incidence for nucleotide position pairs. A suitable alignment constraint is obtained by thresholding the co-incidence probabilities. The constraint is integrated with Dynalign, a free energy minimization algorithm for joint alignment and secondary structure prediction. The resulting method is benchmarked against the previous version of Dynalign and against other programs for pairwise RNA structure prediction. Results The proposed technique eliminates manual parameter selection in Dynalign and provides significant computational time savings in comparison to prior constraints in Dynalign while simultaneously providing a small improvement in the structural prediction accuracy. Savings are also realized in memory. In experiments over a 5S RNA dataset with average sequence length of approximately 120 nucleotides, the method reduces computation by a factor of 2. The method performs favorably in comparison to other programs for pairwise RNA structure prediction: yielding better accuracy, on average, and requiring significantly lesser computational resources. Conclusion Probabilistic analysis can be utilized in order to automate the determination of alignment constraints for pairwise RNA structure prediction methods in a principled fashion. These constraints can reduce the computational and memory requirements of these methods while maintaining or improving their accuracy of structural prediction. This extends the practical reach of these methods to longer length sequences. The revised Dynalign code is freely available for download. PMID:17445273
A Stochastic Evolutionary Model for Protein Structure Alignment and Phylogeny
Challis, Christopher J.; Schmidler, Scott C.
2012-01-01
We present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model, and mutations follow a standard substitution matrix, whereas backbone atoms diffuse in three-dimensional space according to an Ornstein–Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables phylogenetic inference on time scales not previously attainable with sequence evolution models. The model also provides a tool for testing evolutionary hypotheses and improving our understanding of protein structural evolution. PMID:22723302
A novel approach to identifying regulatory motifs in distantly related genomes
Van Hellemont, Ruth; Monsieurs, Pieter; Thijs, Gert; De Moor, Bart; Van de Peer, Yves; Marchal, Kathleen
2005-01-01
Although proven successful in the identification of regulatory motifs, phylogenetic footprinting methods still show some shortcomings. To assess these difficulties, most apparent when applying phylogenetic footprinting to distantly related organisms, we developed a two-step procedure that combines the advantages of sequence alignment and motif detection approaches. The results on well-studied benchmark datasets indicate that the presented method outperforms other methods when the sequences become either too long or too heterogeneous in size. PMID:16420672
Zemla, Adam T; Lang, Dorothy M; Kostova, Tanya; Andino, Raul; Ecale Zhou, Carol L
2011-06-02
Most of the currently used methods for protein function prediction rely on sequence-based comparisons between a query protein and those for which a functional annotation is provided. A serious limitation of sequence similarity-based approaches for identifying residue conservation among proteins is the low confidence in assigning residue-residue correspondences among proteins when the level of sequence identity between the compared proteins is poor. Multiple sequence alignment methods are more satisfactory--still, they cannot provide reliable results at low levels of sequence identity. Our goal in the current work was to develop an algorithm that could help overcome these difficulties by facilitating the identification of structurally (and possibly functionally) relevant residue-residue correspondences between compared protein structures. Here we present StralSV (structure-alignment sequence variability), a new algorithm for detecting closely related structure fragments and quantifying residue frequency from tight local structure alignments. We apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus, and we demonstrate that the algorithm can be used to determine regions of the protein that are relatively unique, or that share structural similarity with proteins that would be considered distantly related. By quantifying residue frequencies among many residue-residue pairs extracted from local structural alignments, one can infer potential structural or functional importance of specific residues that are determined to be highly conserved or that deviate from a consensus. We further demonstrate that considerable detailed structural and phylogenetic information can be derived from StralSV analyses. StralSV is a new structure-based algorithm for identifying and aligning structure fragments that have similarity to a reference protein. StralSV analysis can be used to quantify residue-residue correspondences and identify residues that may be of particular structural or functional importance, as well as unusual or unexpected residues at a given sequence position. StralSV is provided as a web service at http://proteinmodel.org/AS2TS/STRALSV/.
Structural re-alignment in an immunologic surface region of ricin A chain
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zemla, A T; Zhou, C E
2007-07-24
We compared structure alignments generated by several protein structure comparison programs to determine whether existing methods would satisfactorily align residues at a highly conserved position within an immunogenic loop in ribosome inactivating proteins (RIPs). Using default settings, structure alignments generated by several programs (CE, DaliLite, FATCAT, LGA, MAMMOTH, MATRAS, SHEBA, SSM) failed to align the respective conserved residues, although LGA reported correct residue-residue (R-R) correspondences when the beta-carbon (Cb) position was used as the point of reference in the alignment calculations. Further tests using variable points of reference indicated that points distal from the beta carbon along a vector connectingmore » the alpha and beta carbons yielded rigid structural alignments in which residues known to be highly conserved in RIPs were reported as corresponding residues in structural comparisons between ricin A chain, abrin-A, and other RIPs. Results suggest that approaches to structure alignment employing alternate point representations corresponding to side chain position may yield structure alignments that are more consistent with observed conservation of functional surface residues than do standard alignment programs, which apply uniform criteria for alignment (i.e., alpha carbon (Ca) as point of reference) along the entirety of the peptide chain. We present the results of tests that suggest the utility of allowing user-specified points of reference in generating alternate structural alignments, and we present a web server for automatically generating such alignments.« less
Efficient alignment-free DNA barcode analytics.
Kuksa, Pavel; Pavlovic, Vladimir
2009-11-10
In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding.
Towards accurate modeling of noncovalent interactions for protein rigidity analysis.
Fox, Naomi; Streinu, Ileana
2013-01-01
Protein rigidity analysis is an efficient computational method for extracting flexibility information from static, X-ray crystallography protein data. Atoms and bonds are modeled as a mechanical structure and analyzed with a fast graph-based algorithm, producing a decomposition of the flexible molecule into interconnected rigid clusters. The result depends critically on noncovalent atomic interactions, primarily on how hydrogen bonds and hydrophobic interactions are computed and modeled. Ongoing research points to the stringent need for benchmarking rigidity analysis software systems, towards the goal of increasing their accuracy and validating their results, either against each other and against biologically relevant (functional) parameters. We propose two new methods for modeling hydrogen bonds and hydrophobic interactions that more accurately reflect a mechanical model, without being computationally more intensive. We evaluate them using a novel scoring method, based on the B-cubed score from the information retrieval literature, which measures how well two cluster decompositions match. To evaluate the modeling accuracy of KINARI, our pebble-game rigidity analysis system, we use a benchmark data set of 20 proteins, each with multiple distinct conformations deposited in the Protein Data Bank. Cluster decompositions for them were previously determined with the RigidFinder method from Gerstein's lab and validated against experimental data. When KINARI's default tuning parameters are used, an improvement of the B-cubed score over a crude baseline is observed in 30% of this data. With our new modeling options, improvements were observed in over 70% of the proteins in this data set. We investigate the sensitivity of the cluster decomposition score with case studies on pyruvate phosphate dikinase and calmodulin. To substantially improve the accuracy of protein rigidity analysis systems, thorough benchmarking must be performed on all current systems and future extensions. We have measured the gain in performance by comparing different modeling methods for noncovalent interactions. We showed that new criteria for modeling hydrogen bonds and hydrophobic interactions can significantly improve the results. The two new methods proposed here have been implemented and made publicly available in the current version of KINARI (v1.3), together with the benchmarking tools, which can be downloaded from our software's website, http://kinari.cs.umass.edu.
Towards accurate modeling of noncovalent interactions for protein rigidity analysis
2013-01-01
Background Protein rigidity analysis is an efficient computational method for extracting flexibility information from static, X-ray crystallography protein data. Atoms and bonds are modeled as a mechanical structure and analyzed with a fast graph-based algorithm, producing a decomposition of the flexible molecule into interconnected rigid clusters. The result depends critically on noncovalent atomic interactions, primarily on how hydrogen bonds and hydrophobic interactions are computed and modeled. Ongoing research points to the stringent need for benchmarking rigidity analysis software systems, towards the goal of increasing their accuracy and validating their results, either against each other and against biologically relevant (functional) parameters. We propose two new methods for modeling hydrogen bonds and hydrophobic interactions that more accurately reflect a mechanical model, without being computationally more intensive. We evaluate them using a novel scoring method, based on the B-cubed score from the information retrieval literature, which measures how well two cluster decompositions match. Results To evaluate the modeling accuracy of KINARI, our pebble-game rigidity analysis system, we use a benchmark data set of 20 proteins, each with multiple distinct conformations deposited in the Protein Data Bank. Cluster decompositions for them were previously determined with the RigidFinder method from Gerstein's lab and validated against experimental data. When KINARI's default tuning parameters are used, an improvement of the B-cubed score over a crude baseline is observed in 30% of this data. With our new modeling options, improvements were observed in over 70% of the proteins in this data set. We investigate the sensitivity of the cluster decomposition score with case studies on pyruvate phosphate dikinase and calmodulin. Conclusion To substantially improve the accuracy of protein rigidity analysis systems, thorough benchmarking must be performed on all current systems and future extensions. We have measured the gain in performance by comparing different modeling methods for noncovalent interactions. We showed that new criteria for modeling hydrogen bonds and hydrophobic interactions can significantly improve the results. The two new methods proposed here have been implemented and made publicly available in the current version of KINARI (v1.3), together with the benchmarking tools, which can be downloaded from our software's website, http://kinari.cs.umass.edu. PMID:24564209
RAMTaB: Robust Alignment of Multi-Tag Bioimages
Raza, Shan-e-Ahmed; Humayun, Ahmad; Abouna, Sylvie; Nattkemper, Tim W.; Epstein, David B. A.; Khan, Michael; Rajpoot, Nasir M.
2012-01-01
Background In recent years, new microscopic imaging techniques have evolved to allow us to visualize several different proteins (or other biomolecules) in a visual field. Analysis of protein co-localization becomes viable because molecules can interact only when they are located close to each other. We present a novel approach to align images in a multi-tag fluorescence image stack. The proposed approach is applicable to multi-tag bioimaging systems which (a) acquire fluorescence images by sequential staining and (b) simultaneously capture a phase contrast image corresponding to each of the fluorescence images. To the best of our knowledge, there is no existing method in the literature, which addresses simultaneous registration of multi-tag bioimages and selection of the reference image in order to maximize the overall overlap between the images. Methodology/Principal Findings We employ a block-based method for registration, which yields a confidence measure to indicate the accuracy of our registration results. We derive a shift metric in order to select the Reference Image with Maximal Overlap (RIMO), in turn minimizing the total amount of non-overlapping signal for a given number of tags. Experimental results show that the Robust Alignment of Multi-Tag Bioimages (RAMTaB) framework is robust to variations in contrast and illumination, yields sub-pixel accuracy, and successfully selects the reference image resulting in maximum overlap. The registration results are also shown to significantly improve any follow-up protein co-localization studies. Conclusions For the discovery of protein complexes and of functional protein networks within a cell, alignment of the tag images in a multi-tag fluorescence image stack is a key pre-processing step. The proposed framework is shown to produce accurate alignment results on both real and synthetic data. Our future work will use the aligned multi-channel fluorescence image data for normal and diseased tissue specimens to analyze molecular co-expression patterns and functional protein networks. PMID:22363510
EFICAz2: enzyme function inference by a combined approach enhanced by machine learning.
Arakaki, Adrian K; Huang, Ying; Skolnick, Jeffrey
2009-04-13
We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment. We have developed two new EFICAz components, analogs to the two FDR-based components, where the discrimination between homo and heterofunctional members is based on the evaluation, via Support Vector Machine models, of all the aligned positions between the query sequence and the multiple sequence alignments associated to the enzyme families. Benchmark results indicate that: i) the new SVM-based components outperform their FDR-based counterparts, and ii) both SVM-based and FDR-based components generate unique predictions. We developed classification tree models to optimally combine the results from the six EFICAz components into a final EC number prediction. The new implementation of our approach, EFICAz2, exhibits a highly improved prediction precision at MTTSI < 30% compared to the original EFICAz, with only a slight decrease in prediction recall. A comparative analysis of enzyme function annotation of the human proteome by EFICAz2 and KEGG shows that: i) when both sources make EC number assignments for the same protein sequence, the assignments tend to be consistent and ii) EFICAz2 generates considerably more unique assignments than KEGG. Performance benchmarks and the comparison with KEGG demonstrate that EFICAz2 is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction. The EFICAz2 web service is available at: http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html.
Ahadian, Samad; Ramón-Azcón, Javier; Estili, Mehdi; Liang, Xiaobin; Ostrovidov, Serge; Shiku, Hitoshi; Ramalingam, Murugan; Nakajima, Ken; Sakka, Yoshio; Bae, Hojae; Matsue, Tomokazu; Khademhosseini, Ali
2014-03-19
Biological scaffolds with tunable electrical and mechanical properties are of great interest in many different fields, such as regenerative medicine, biorobotics, and biosensing. In this study, dielectrophoresis (DEP) was used to vertically align carbon nanotubes (CNTs) within methacrylated gelatin (GelMA) hydrogels in a robust, simple, and rapid manner. GelMA-aligned CNT hydrogels showed anisotropic electrical conductivity and superior mechanical properties compared with pristine GelMA hydrogels and GelMA hydrogels containing randomly distributed CNTs. Skeletal muscle cells grown on vertically aligned CNTs in GelMA hydrogels yielded a higher number of functional myofibers than cells that were cultured on hydrogels with randomly distributed CNTs and horizontally aligned CNTs, as confirmed by the expression of myogenic genes and proteins. In addition, the myogenic gene and protein expression increased more profoundly after applying electrical stimulation along the direction of the aligned CNTs due to the anisotropic conductivity of the hybrid GelMA-vertically aligned CNT hydrogels. We believe that platform could attract great attention in other biomedical applications, such as biosensing, bioelectronics, and creating functional biomedical devices.
Ahadian, Samad; Ramón-Azcón, Javier; Estili, Mehdi; Liang, Xiaobin; Ostrovidov, Serge; Shiku, Hitoshi; Ramalingam, Murugan; Nakajima, Ken; Sakka, Yoshio; Bae, Hojae; Matsue, Tomokazu; Khademhosseini, Ali
2014-01-01
Biological scaffolds with tunable electrical and mechanical properties are of great interest in many different fields, such as regenerative medicine, biorobotics, and biosensing. In this study, dielectrophoresis (DEP) was used to vertically align carbon nanotubes (CNTs) within methacrylated gelatin (GelMA) hydrogels in a robust, simple, and rapid manner. GelMA-aligned CNT hydrogels showed anisotropic electrical conductivity and superior mechanical properties compared with pristine GelMA hydrogels and GelMA hydrogels containing randomly distributed CNTs. Skeletal muscle cells grown on vertically aligned CNTs in GelMA hydrogels yielded a higher number of functional myofibers than cells that were cultured on hydrogels with randomly distributed CNTs and horizontally aligned CNTs, as confirmed by the expression of myogenic genes and proteins. In addition, the myogenic gene and protein expression increased more profoundly after applying electrical stimulation along the direction of the aligned CNTs due to the anisotropic conductivity of the hybrid GelMA-vertically aligned CNT hydrogels. We believe that platform could attract great attention in other biomedical applications, such as biosensing, bioelectronics, and creating functional biomedical devices. PMID:24642903
NASA Astrophysics Data System (ADS)
Ahadian, Samad; Ramón-Azcón, Javier; Estili, Mehdi; Liang, Xiaobin; Ostrovidov, Serge; Shiku, Hitoshi; Ramalingam, Murugan; Nakajima, Ken; Sakka, Yoshio; Bae, Hojae; Matsue, Tomokazu; Khademhosseini, Ali
2014-03-01
Biological scaffolds with tunable electrical and mechanical properties are of great interest in many different fields, such as regenerative medicine, biorobotics, and biosensing. In this study, dielectrophoresis (DEP) was used to vertically align carbon nanotubes (CNTs) within methacrylated gelatin (GelMA) hydrogels in a robust, simple, and rapid manner. GelMA-aligned CNT hydrogels showed anisotropic electrical conductivity and superior mechanical properties compared with pristine GelMA hydrogels and GelMA hydrogels containing randomly distributed CNTs. Skeletal muscle cells grown on vertically aligned CNTs in GelMA hydrogels yielded a higher number of functional myofibers than cells that were cultured on hydrogels with randomly distributed CNTs and horizontally aligned CNTs, as confirmed by the expression of myogenic genes and proteins. In addition, the myogenic gene and protein expression increased more profoundly after applying electrical stimulation along the direction of the aligned CNTs due to the anisotropic conductivity of the hybrid GelMA-vertically aligned CNT hydrogels. We believe that platform could attract great attention in other biomedical applications, such as biosensing, bioelectronics, and creating functional biomedical devices.
NASA Astrophysics Data System (ADS)
Aldrin, John C.; Hopkins, Deborah; Datuin, Marvin; Warchol, Mark; Warchol, Lyudmila; Forsyth, David S.; Buynak, Charlie; Lindgren, Eric A.
2017-02-01
For model benchmark studies, the accuracy of the model is typically evaluated based on the change in response relative to a selected reference signal. The use of a side drilled hole (SDH) in a plate was investigated as a reference signal for angled beam shear wave inspection for aircraft structure inspections of fastener sites. Systematic studies were performed with varying SDH depth and size, and varying the ultrasonic probe frequency, focal depth, and probe height. Increased error was observed with the simulation of angled shear wave beams in the near-field. Even more significant, asymmetry in real probes and the inherent sensitivity of signals in the near-field to subtle test conditions were found to provide a greater challenge with achieving model agreement. To achieve quality model benchmark results for this problem, it is critical to carefully align the probe with the part geometry, to verify symmetry in probe response, and ideally avoid using reference signals from the near-field response. Suggested reference signals for angled beam shear wave inspections include using the `through hole' corner specular reflection signal and the full skip' signal off of the far wall from the side drilled hole.
Modified Mahalanobis Taguchi System for Imbalance Data Classification
2017-01-01
The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm (MGA). PMID:28811820
Tuppurainen, Kari; Viisas, Marja; Laatikainen, Reino; Peräkylä, Mikael
2002-01-01
A novel electronic eigenvalue (EEVA) descriptor of molecular structure for use in the derivation of predictive QSAR/QSPR models is described. Like other spectroscopic QSAR/QSPR descriptors, EEVA is also invariant as to the alignment of the structures concerned. Its performance was tested with respect to the CBG (corticosteroid binding globulin) affinity of 31 benchmark steroids. It appeared that the electronic structure of the steroids, i.e., the "spectra" derived from molecular orbital energies, is directly related to the CBG binding affinities. The predictive ability of EEVA is compared to other QSAR approaches, and its performance is discussed in the context of the Hammett equation. The good performance of EEVA is an indication of the essential quantum mechanical nature of QSAR. The EEVA method is a supplement to conventional 3D QSAR methods, which employ fields or surface properties derived from Coulombic and van der Waals interactions.
HMM-ModE: implementation, benchmarking and validation with HMMER3
2014-01-01
Background HMM-ModE is a computational method that generates family specific profile HMMs using negative training sequences. The method optimizes the discrimination threshold using 10 fold cross validation and modifies the emission probabilities of profiles to reduce common fold based signals shared with other sub-families. The protocol depends on the program HMMER for HMM profile building and sequence database searching. The recent release of HMMER3 has improved database search speed by several orders of magnitude, allowing for the large scale deployment of the method in sequence annotation projects. We have rewritten our existing scripts both at the level of parsing the HMM profiles and modifying emission probabilities to upgrade HMM-ModE using HMMER3 that takes advantage of its probabilistic inference with high computational speed. The method is benchmarked and tested on GPCR dataset as an accurate and fast method for functional annotation. Results The implementation of this method, which now works with HMMER3, is benchmarked with the earlier version of HMMER, to show that the effect of local-local alignments is marked only in the case of profiles containing a large number of discontinuous match states. The method is tested on a gold standard set of families and we have reported a significant reduction in the number of false positive hits over the default HMM profiles. When implemented on GPCR sequences, the results showed an improvement in the accuracy of classification compared with other methods used to classify the familyat different levels of their classification hierarchy. Conclusions The present findings show that the new version of HMM-ModE is a highly specific method used to differentiate between fold (superfamily) and function (family) specific signals, which helps in the functional annotation of protein sequences. The use of modified profile HMMs of GPCR sequences provides a simple yet highly specific method for classification of the family, being able to predict the sub-family specific sequences with high accuracy even though sequences share common physicochemical characteristics between sub-families. PMID:25073805
Li, Yushuang; Yang, Jiasheng; Zhang, Yi
2016-01-01
In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector. PMID:27918587
DOE Office of Scientific and Technical Information (OSTI.GOV)
Valdes, Haydee; Pluhackova, Kristyna; Hobza, Pavel
The performance of a wide range of quantum chemical calculations for the ab initio study of realistic model systems of aromatic-aromatic side chain interactions in proteins (in particular those π-π interactions occurring between adjacent residues along the protein sequence) is here assessed on the phenylalanyl-glycyl-phenylalanine (FGF) tripeptide. Energies and geometries obtained at different levels of theory are compared with CCSD(T)/CBS benchmark energies and RI-MP2/cc-pVTZ benchmark geometries, respectively. Consequently, a protocol of calculation alternative to the very expensive CCSD(T)/CBS is proposed. In addition to this, the preferred orientation of the Phe aromatic side chains is discussed and compared with previous resultsmore » on the topic.« less
Mote, Kaustubh R; Gopinath, T; Traaseth, Nathaniel J; Kitchen, Jason; Gor'kov, Peter L; Brey, William W; Veglia, Gianluigi
2011-11-01
Oriented solid-state NMR is the most direct methodology to obtain the orientation of membrane proteins with respect to the lipid bilayer. The method consists of measuring (1)H-(15)N dipolar couplings (DC) and (15)N anisotropic chemical shifts (CSA) for membrane proteins that are uniformly aligned with respect to the membrane bilayer. A significant advantage of this approach is that tilt and azimuthal (rotational) angles of the protein domains can be directly derived from analytical expression of DC and CSA values, or, alternatively, obtained by refining protein structures using these values as harmonic restraints in simulated annealing calculations. The Achilles' heel of this approach is the lack of suitable experiments for sequential assignment of the amide resonances. In this Article, we present a new pulse sequence that integrates proton driven spin diffusion (PDSD) with sensitivity-enhanced PISEMA in a 3D experiment ([(1)H,(15)N]-SE-PISEMA-PDSD). The incorporation of 2D (15)N/(15)N spin diffusion experiments into this new 3D experiment leads to the complete and unambiguous assignment of the (15)N resonances. The feasibility of this approach is demonstrated for the membrane protein sarcolipin reconstituted in magnetically aligned lipid bicelles. Taken with low electric field probe technology, this approach will propel the determination of sequential assignment as well as structure and topology of larger integral membrane proteins in aligned lipid bilayers. © Springer Science+Business Media B.V. 2011
An alternative view of protein fold space.
Shindyalov, I N; Bourne, P E
2000-02-15
Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50-150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect--the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at http://cl.sdsc.edu/ce.html.
Predicting Protein-protein Association Rates using Coarse-grained Simulation and Machine Learning
NASA Astrophysics Data System (ADS)
Xie, Zhong-Ru; Chen, Jiawen; Wu, Yinghao
2017-04-01
Protein-protein interactions dominate all major biological processes in living cells. We have developed a new Monte Carlo-based simulation algorithm to study the kinetic process of protein association. We tested our method on a previously used large benchmark set of 49 protein complexes. The predicted rate was overestimated in the benchmark test compared to the experimental results for a group of protein complexes. We hypothesized that this resulted from molecular flexibility at the interface regions of the interacting proteins. After applying a machine learning algorithm with input variables that accounted for both the conformational flexibility and the energetic factor of binding, we successfully identified most of the protein complexes with overestimated association rates and improved our final prediction by using a cross-validation test. This method was then applied to a new independent test set and resulted in a similar prediction accuracy to that obtained using the training set. It has been thought that diffusion-limited protein association is dominated by long-range interactions. Our results provide strong evidence that the conformational flexibility also plays an important role in regulating protein association. Our studies provide new insights into the mechanism of protein association and offer a computationally efficient tool for predicting its rate.
Predicting Protein-protein Association Rates using Coarse-grained Simulation and Machine Learning.
Xie, Zhong-Ru; Chen, Jiawen; Wu, Yinghao
2017-04-18
Protein-protein interactions dominate all major biological processes in living cells. We have developed a new Monte Carlo-based simulation algorithm to study the kinetic process of protein association. We tested our method on a previously used large benchmark set of 49 protein complexes. The predicted rate was overestimated in the benchmark test compared to the experimental results for a group of protein complexes. We hypothesized that this resulted from molecular flexibility at the interface regions of the interacting proteins. After applying a machine learning algorithm with input variables that accounted for both the conformational flexibility and the energetic factor of binding, we successfully identified most of the protein complexes with overestimated association rates and improved our final prediction by using a cross-validation test. This method was then applied to a new independent test set and resulted in a similar prediction accuracy to that obtained using the training set. It has been thought that diffusion-limited protein association is dominated by long-range interactions. Our results provide strong evidence that the conformational flexibility also plays an important role in regulating protein association. Our studies provide new insights into the mechanism of protein association and offer a computationally efficient tool for predicting its rate.
Capital planning for clinical integration.
Grauman, Daniel M; Neff, Gerald; Johnson, Molly Martha
2011-04-01
When assessing the financial implications of a physician alignment and clinical integration initiative, a hospital should measure the initiative's potential ROI, perhaps best using a combination of net present value and payback period. The hospital should compare its own historical and projected performance with rating agency median benchmarks for key financial indicators of profitability, debt service, capital and cash flow, and liquidity. The hospital should also consider potential indirect benefits, such as retained outpatient/ancillary revenue, increased inpatient revenue, improved cost control, and improved quality and reporting transparency.
Structural alignment of protein descriptors - a combinatorial model.
Antczak, Maciej; Kasprzak, Marta; Lukasiak, Piotr; Blazewicz, Jacek
2016-09-17
Structural alignment of proteins is one of the most challenging problems in molecular biology. The tertiary structure of a protein strictly correlates with its function and computationally predicted structures are nowadays a main premise for understanding the latter. However, computationally derived 3D models often exhibit deviations from the native structure. A way to confirm a model is a comparison with other structures. The structural alignment of a pair of proteins can be defined with the use of a concept of protein descriptors. The protein descriptors are local substructures of protein molecules, which allow us to divide the original problem into a set of subproblems and, consequently, to propose a more efficient algorithmic solution. In the literature, one can find many applications of the descriptors concept that prove its usefulness for insight into protein 3D structures, but the proposed approaches are presented rather from the biological perspective than from the computational or algorithmic point of view. Efficient algorithms for identification and structural comparison of descriptors can become crucial components of methods for structural quality assessment as well as tertiary structure prediction. In this paper, we propose a new combinatorial model and new polynomial-time algorithms for the structural alignment of descriptors. The model is based on the maximum-size assignment problem, which we define here and prove that it can be solved in polynomial time. We demonstrate suitability of this approach by comparison with an exact backtracking algorithm. Besides a simplification coming from the combinatorial modeling, both on the conceptual and complexity level, we gain with this approach high quality of obtained results, in terms of 3D alignment accuracy and processing efficiency. All the proposed algorithms were developed and integrated in a computationally efficient tool descs-standalone, which allows the user to identify and structurally compare descriptors of biological molecules, such as proteins and RNAs. Both PDB (Protein Data Bank) and mmCIF (macromolecular Crystallographic Information File) formats are supported. The proposed tool is available as an open source project stored on GitHub ( https://github.com/mantczak/descs-standalone ).
A PDB-wide, evolution-based assessment of protein-protein interfaces.
Baskaran, Kumaran; Duarte, Jose M; Biyani, Nikhil; Bliven, Spencer; Capitani, Guido
2014-10-18
Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\\#downloads. Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.
Tripathi, Pooja; Pandey, Paras N
2017-07-07
The present work employs pseudo amino acid composition (PseAAC) for encoding the protein sequences in their numeric form. Later this will be arranged in the similarity matrix, which serves as input for spectral graph clustering method. Spectral methods are used previously also for clustering of protein sequences, but they uses pair wise alignment scores of protein sequences, in similarity matrix. The alignment score depends on the length of sequences, so clustering short and long sequences together may not good idea. Therefore the idea of introducing PseAAC with spectral clustering algorithm came into scene. We extensively tested our method and compared its performance with other existing machine learning methods. It is consistently observed that, the number of clusters that we obtained for a given set of proteins is close to the number of superfamilies in that set and PseAAC combined with spectral graph clustering shows the best classification results. Copyright © 2017 Elsevier Ltd. All rights reserved.
Template-based protein structure modeling using the RaptorX web server.
Källberg, Morten; Wang, Haipeng; Wang, Sheng; Peng, Jian; Wang, Zhiyong; Lu, Hui; Xu, Jinbo
2012-07-19
A key challenge of modern biology is to uncover the functional role of the protein entities that compose cellular proteomes. To this end, the availability of reliable three-dimensional atomic models of proteins is often crucial. This protocol presents a community-wide web-based method using RaptorX (http://raptorx.uchicago.edu/) for protein secondary structure prediction, template-based tertiary structure modeling, alignment quality assessment and sophisticated probabilistic alignment sampling. RaptorX distinguishes itself from other servers by the quality of the alignment between a target sequence and one or multiple distantly related template proteins (especially those with sparse sequence profiles) and by a novel nonlinear scoring function and a probabilistic-consistency algorithm. Consequently, RaptorX delivers high-quality structural models for many targets with only remote templates. At present, it takes RaptorX ~35 min to finish processing a sequence of 200 amino acids. Since its official release in August 2011, RaptorX has processed ~6,000 sequences submitted by ~1,600 users from around the world.
Template-based protein structure modeling using the RaptorX web server
Källberg, Morten; Wang, Haipeng; Wang, Sheng; Peng, Jian; Wang, Zhiyong; Lu, Hui; Xu, Jinbo
2016-01-01
A key challenge of modern biology is to uncover the functional role of the protein entities that compose cellular proteomes. To this end, the availability of reliable three-dimensional atomic models of proteins is often crucial. This protocol presents a community-wide web-based method using RaptorX (http://raptorx.uchicago.edu/) for protein secondary structure prediction, template-based tertiary structure modeling, alignment quality assessment and sophisticated probabilistic alignment sampling. RaptorX distinguishes itself from other servers by the quality of the alignment between a target sequence and one or multiple distantly related template proteins (especially those with sparse sequence profiles) and by a novel nonlinear scoring function and a probabilistic-consistency algorithm. Consequently, RaptorX delivers high-quality structural models for many targets with only remote templates. At present, it takes RaptorX ~35 min to finish processing a sequence of 200 amino acids. Since its official release in August 2011, RaptorX has processed ~6,000 sequences submitted by ~1,600 users from around the world. PMID:22814390
Next Generation Science Partnerships
NASA Astrophysics Data System (ADS)
Magnusson, J.
2016-02-01
I will provide an overview of the Next Generation Science Standards (NGSS) and demonstrate how scientists and educators can use these standards to strengthen and enhance their collaborations. The NGSS are rich in content and practice and provide all students with an internationally-benchmarked science education. Using these state-led standards to guide outreach efforts can help develop and sustain effective and mutually beneficial teacher-researcher partnerships. Aligning outreach with the three dimensions of the standards can help make research relevant for target audiences by intentionally addressing the science practices, cross-cutting concepts, and disciplinary core ideas of the K-12 science curriculum that drives instruction and assessment. Collaborations between researchers and educators that are based on this science framework are more sustainable because they address the needs of both scientists and educators. Educators are better able to utilize science content that aligns with their curriculum. Scientists who learn about the NGSS can better understand the frameworks under which educators work, which can lead to more extensive and focused outreach with teachers as partners. Based on this model, the International Ocean Discovery Program (IODP) develops its education materials in conjunction with scientists and educators to produce accurate, standards-aligned activities and curriculum-based interactions with researchers. I will highlight examples of IODP's current, successful teacher-researcher collaborations that are intentionally aligned with the NGSS.
NASA Astrophysics Data System (ADS)
Couture, Jean; Boily, Edouard; Simard, Marc-Alain
1996-05-01
The research and development group at Loral Canada is now at the second phase of the development of a data fusion demonstration model (DFDM) for a naval anti-air warfare to be used as a workbench tool to perform exploratory research. This project has emphatically addressed how the concepts related to fusion could be implemented within the Canadian Patrol Frigate (CPF) software environment. The project has been designed to read data passively on the CPF bus without any modification to the CPF software. This has brought to light important time alignment issues since the CPF sensors and the CPF command and control system were not important time alignment issues since the CPF sensors and the CPF command and control system were not originally designed to support a track management function which fuses information. The fusion of data from non-organic sensors with the tactical Link-11 data has produced stimulating spatial alignment problems which have been overcome by the use of a geodetic referencing coordinate system. Some benchmark scenarios have been selected to quantitatively demonstrate the capabilities of this fusion implementation. This paper describes the implementation design of DFDM (version 2), and summarizes the results obtained so far when fusing the scenarios simulated data.
Scherer, N M; Basso, D M
2008-09-16
DNATagger is a web-based tool for coloring and editing DNA, RNA and protein sequences and alignments. It is dedicated to the visualization of protein coding sequences and also protein sequence alignments to facilitate the comprehension of evolutionary processes in sequence analysis. The distinctive feature of DNATagger is the use of codons as informative units for coloring DNA and RNA sequences. The codons are colored according to their corresponding amino acids. It is the first program that colors codons in DNA sequences without being affected by "out-of-frame" gaps of alignments. It can handle single gaps and gaps inside the triplets. The program also provides the possibility to edit the alignments and change color patterns and translation tables. DNATagger is a JavaScript application, following the W3C guidelines, designed to work on standards-compliant web browsers. It therefore requires no installation and is platform independent. The web-based DNATagger is available as free and open source software at http://www.inf.ufrgs.br/~dmbasso/dnatagger/.
Analysis of Ribosome Inactivating Protein (RIP): A Bioinformatics Approach
NASA Astrophysics Data System (ADS)
Jothi, G. Edward Gnana; Majilla, G. Sahaya Jose; Subhashini, D.; Deivasigamani, B.
2012-10-01
In spite of the medical advances in recent years, the world is in need of different sources to encounter certain health issues.Ribosome Inactivating Proteins (RIPs) were found to be one among them. In order to get easy access about RIPs, there is a need to analyse RIPs towards constructing a database on RIPs. Also, multiple sequence alignment was done towards screening for homologues of significant RIPs from rare sources against RIPs from easily available sources in terms of similarity. Protein sequences were retrieved from SWISS-PROT and are further analysed using pair wise and multiple sequence alignment.Analysis shows that, 151 RIPs have been characterized to date. Amongst them, there are 87 type I, 37 type II, 1 type III and 25 unknown RIPs. The sequence length information of various RIPs about the availability of full or partial sequence was also found. The multiple sequence alignment of 37 type I RIP using the online server Multalin, indicates the presence of 20 conserved residues. Pairwise alignment and multiple sequence alignment of certain selected RIPs in two groups namely Group I and Group II were carried out and the consensus level was found to be 98%, 98% and 90% respectively.
Pugsley, Haley R.; Swearingen, Kristian E.; Dovichi, Norman J.
2009-01-01
A number of algorithms have been developed to correct for migration time drift in capillary electrophoresis. Those algorithms require identification of common components in each run. However, not all components may be present or resolved in separations of complex samples, which can confound attempts for alignment. This paper reports the use of fluorescein thiocarbamyl derivatives of amino acids as internal standards for alignment of 3-(2-furoyl)quinoline-2-carboxaldehyde (FQ)-labeled proteins in capillary sieving electrophoresis. The fluorescein thiocarbamyl derivative of aspartic acid migrates before FQ-labeled proteins and the fluorescein thiocarbamyl derivative of arginine migrates after the FQ-labeled proteins. These compounds were used as internal standards to correct for variations in migration time over a two-week period in the separation of a cellular homogenate. The experimental conditions were deliberately manipulated by varying electric field and sample preparation conditions. Three components of the homogenate were used to evaluate the alignment efficiency. Before alignment, the average relative standard deviation in migration time for these components was 13.3%. After alignment, the average relative standard deviation in migration time for these components was reduced to 0.5%. PMID:19249052
An efficient algorithm for pairwise local alignment of protein interaction networks
Chen, Wenbin; Schmidt, Matthew; Tian, Wenhong; ...
2015-04-01
Recently, researchers seeking to understand, modify, and create beneficial traits in organisms have looked for evolutionarily conserved patterns of protein interactions. Their conservation likely means that the proteins of these conserved functional modules are important to the trait's expression. In this paper, we formulate the problem of identifying these conserved patterns as a graph optimization problem, and develop a fast heuristic algorithm for this problem. We compare the performance of our network alignment algorithm to that of the MaWISh algorithm [Koyuturk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A, Pairwise alignment of protein interaction networks, J Computmore » Biol 13(2): 182-199, 2006.], which bases its search algorithm on a related decision problem formulation. We find that our algorithm discovers conserved modules with a larger number of proteins in an order of magnitude less time. In conclusion, the protein sets found by our algorithm correspond to known conserved functional modules at comparable precision and recall rates as those produced by the MaWISh algorithm.« less
Coan, Heather B.; Youker, Robert T.
2017-01-01
Understanding how proteins mutate is critical to solving a host of biological problems. Mutations occur when an amino acid is substituted for another in a protein sequence. The set of likelihoods for amino acid substitutions is stored in a matrix and input to alignment algorithms. The quality of the resulting alignment is used to assess the similarity of two or more sequences and can vary according to assumptions modeled by the substitution matrix. Substitution strategies with minor parameter variations are often grouped together in families. For example, the BLOSUM and PAM matrix families are commonly used because they provide a standard, predefined way of modeling substitutions. However, researchers often do not know if a given matrix family or any individual matrix within a family is the most suitable. Furthermore, predefined matrix families may inaccurately reflect a particular hypothesis that a researcher wishes to model or otherwise result in unsatisfactory alignments. In these cases, the ability to compare the effects of one or more custom matrices may be needed. This laborious process is often performed manually because the ability to simultaneously load multiple matrices and then compare their effects on alignments is not readily available in current software tools. This paper presents SubVis, an interactive R package for loading and applying multiple substitution matrices to pairwise alignments. Users can simultaneously explore alignments resulting from multiple predefined and custom substitution matrices. SubVis utilizes several of the alignment functions found in R, a common language among protein scientists. Functions are tied together with the Shiny platform which allows the modification of input parameters. Information regarding alignment quality and individual amino acid substitutions is displayed with the JavaScript language which provides interactive visualizations for revealing both high-level and low-level alignment information. PMID:28674656
Pitman, A; Jones, D N; Stuart, D; Lloydhope, K; Mallitt, K; O'Rourke, P
2009-10-01
The study reports on the evolution of the Australian radiologist relative value unit (RVU) model of measuring radiologist reporting workloads in teaching hospital departments, and aims to outline a way forward for the development of a broad national safety, quality and performance framework that enables value mapping, measurement and benchmarking. The Radiology International Benchmarking Project of Queensland Health provided a suitable high-level national forum where the existing Pitman-Jones RVU model was applied to contemporaneous data, and its shortcomings and potential avenues for future development were analysed. Application of the Pitman-Jones model to Queensland data and also a Victorian benchmark showed that the original recommendation of 40,000 crude RVU per full-time equivalent consultant radiologist (97-98 baseline level) has risen only moderately, to now lie around 45,000 crude RVU/full-time equivalent. Notwithstanding this, the model has a number of weaknesses and is becoming outdated, as it cannot capture newer time-consuming examinations particularly in CT. A significant re-evaluation of the value of medical imaging is required, and is now occurring. We must rethink how we measure, benchmark, display and continually improve medical imaging safety, quality and performance, throughout the imaging care cycle and beyond. It will be necessary to ensure alignment with patient needs, as well as clinical and organisational objectives. Clear recommendations for the development of an updated national reporting workload RVU system are available, and an opportunity now exists for developing a much broader national model. A more sophisticated and balanced multidimensional safety, quality and performance framework that enables measurement and benchmarking of all important elements of health-care service is needed.
Improving Protein Fold Recognition by Deep Learning Networks.
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-04
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl's benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Alignment hierarchies: engineering architecture from the nanometre to the micrometre scale.
Kureshi, Alvena; Cheema, Umber; Alekseeva, Tijna; Cambrey, Alison; Brown, Robert
2010-12-06
Natural tissues are built of metabolites, soluble proteins and solid extracellular matrix components (largely fibrils) together with cells. These are configured in highly organized hierarchies of structure across length scales from nanometre to millimetre, with alignments that are dominated by anisotropies in their fibrillar matrix. If we are to successfully engineer tissues, these hierarchies need to be mimicked with an understanding of the interaction between them. In particular, the movement of different elements of the tissue (e.g. molecules, cells and bulk fluids) is controlled by matrix structures at distinct scales. We present three novel systems to introduce alignment of collagen fibrils, cells and growth factor gradients within a three-dimensional collagen scaffold using fluid flow, embossing and layering of construct. Importantly, these can be seen as different parts of the same hierarchy of three-dimensional structure, as they are all formed into dense collagen gels. Fluid flow aligns collagen fibrils at the nanoscale, embossed topographical features provide alignment cues at the microscale and introducing layered configuration to three-dimensional collagen scaffolds provides microscale- and mesoscale-aligned pathways for protein factor delivery as well as barriers to confine protein diffusion to specific spatial directions. These seemingly separate methods can be employed to increase complexity of simple extracellular matrix scaffolds, providing insight into new approaches to directly fabricate complex physical and chemical cues at different hierarchical scales, similar to those in natural tissues.
TIM Barrel Protein Structure Classification Using Alignment Approach and Best Hit Strategy
NASA Astrophysics Data System (ADS)
Chu, Jia-Han; Lin, Chun Yuan; Chang, Cheng-Wen; Lee, Chihan; Yang, Yuh-Shyong; Tang, Chuan Yi
2007-11-01
The classification of protein structures is essential for their function determination in bioinformatics. It has been estimated that around 10% of all known enzymes have TIM barrel domains from the Structural Classification of Proteins (SCOP) database. With its high sequence variation and diverse functionalities, TIM barrel protein becomes to be an attractive target for protein engineering and for the evolution study. Hence, in this paper, an alignment approach with the best hit strategy is proposed to classify the TIM barrel protein structure in terms of superfamily and family levels in the SCOP. This work is also used to do the classification for class level in the Enzyme nomenclature (ENZYME) database. Two testing data sets, TIM40D and TIM95D, both are used to evaluate this approach. The resulting classification has an overall prediction accuracy rate of 90.3% for the superfamily level in the SCOP, 89.5% for the family level in the SCOP and 70.1% for the class level in the ENZYME. These results demonstrate that the alignment approach with the best hit strategy is a simple and viable method for the TIM barrel protein structure classification, even only has the amino acid sequences information.
Multiple DNA and protein sequence alignment on a workstation and a supercomputer.
Tajima, K
1988-11-01
This paper describes a multiple alignment method using a workstation and supercomputer. The method is based on the alignment of a set of aligned sequences with the new sequence, and uses a recursive procedure of such alignment. The alignment is executed in a reasonable computation time on diverse levels from a workstation to a supercomputer, from the viewpoint of alignment results and computational speed by parallel processing. The application of the algorithm is illustrated by several examples of multiple alignment of 12 amino acid and DNA sequences of HIV (human immunodeficiency virus) env genes. Colour graphic programs on a workstation and parallel processing on a supercomputer are discussed.
Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring
2012-01-01
Background Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families. Results The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function. Conclusions Our results demonstrate that the method we present here using a k-modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family. PMID:22793672
Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring.
Durston, Kirk K; Chiu, David Ky; Wong, Andrew Kc; Li, Gary Cl
2012-07-13
Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families. The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function. Our results demonstrate that the method we present here using a k-modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family.
Predicting Protein–protein Association Rates using Coarse-grained Simulation and Machine Learning
Xie, Zhong-Ru; Chen, Jiawen; Wu, Yinghao
2017-01-01
Protein–protein interactions dominate all major biological processes in living cells. We have developed a new Monte Carlo-based simulation algorithm to study the kinetic process of protein association. We tested our method on a previously used large benchmark set of 49 protein complexes. The predicted rate was overestimated in the benchmark test compared to the experimental results for a group of protein complexes. We hypothesized that this resulted from molecular flexibility at the interface regions of the interacting proteins. After applying a machine learning algorithm with input variables that accounted for both the conformational flexibility and the energetic factor of binding, we successfully identified most of the protein complexes with overestimated association rates and improved our final prediction by using a cross-validation test. This method was then applied to a new independent test set and resulted in a similar prediction accuracy to that obtained using the training set. It has been thought that diffusion-limited protein association is dominated by long-range interactions. Our results provide strong evidence that the conformational flexibility also plays an important role in regulating protein association. Our studies provide new insights into the mechanism of protein association and offer a computationally efficient tool for predicting its rate. PMID:28418043
Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.
2017-01-01
Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115
Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S
2017-01-01
As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
A new graph-based method for pairwise global network alignment
Klau, Gunnar W
2009-01-01
Background In addition to component-based comparative approaches, network alignments provide the means to study conserved network topology such as common pathways and more complex network motifs. Yet, unlike in classical sequence alignment, the comparison of networks becomes computationally more challenging, as most meaningful assumptions instantly lead to NP-hard problems. Most previous algorithmic work on network alignments is heuristic in nature. Results We introduce the graph-based maximum structural matching formulation for pairwise global network alignment. We relate the formulation to previous work and prove NP-hardness of the problem. Based on the new formulation we build upon recent results in computational structural biology and present a novel Lagrangian relaxation approach that, in combination with a branch-and-bound method, computes provably optimal network alignments. The Lagrangian algorithm alone is a powerful heuristic method, which produces solutions that are often near-optimal and – unlike those computed by pure heuristics – come with a quality guarantee. Conclusion Computational experiments on the alignment of protein-protein interaction networks and on the classification of metabolic subnetworks demonstrate that the new method is reasonably fast and has advantages over pure heuristics. Our software tool is freely available as part of the LISA library. PMID:19208162
Efficient alignment-free DNA barcode analytics
Kuksa, Pavel; Pavlovic, Vladimir
2009-01-01
Background In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. Results New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Conclusion Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding. PMID:19900305
Miao, Zhichao; Westhof, Eric
2016-07-08
RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Moghram, Basem Ameen; Nabil, Emad; Badr, Amr
2018-01-01
T-cell epitope structure identification is a significant challenging immunoinformatic problem within epitope-based vaccine design. Epitopes or antigenic peptides are a set of amino acids that bind with the Major Histocompatibility Complex (MHC) molecules. The aim of this process is presented by Antigen Presenting Cells to be inspected by T-cells. MHC-molecule-binding epitopes are responsible for triggering the immune response to antigens. The epitope's three-dimensional (3D) molecular structure (i.e., tertiary structure) reflects its proper function. Therefore, the identification of MHC class-II epitopes structure is a significant step towards epitope-based vaccine design and understanding of the immune system. In this paper, we propose a new technique using a Genetic Algorithm for Predicting the Epitope Structure (GAPES), to predict the structure of MHC class-II epitopes based on their sequence. The proposed Elitist-based genetic algorithm for predicting the epitope's tertiary structure is based on Ab-Initio Empirical Conformational Energy Program for Peptides (ECEPP) Force Field Model. The developed secondary structure prediction technique relies on Ramachandran Plot. We used two alignment algorithms: the ROSS alignment and TM-Score alignment. We applied four different alignment approaches to calculate the similarity scores of the dataset under test. We utilized the support vector machine (SVM) classifier as an evaluation of the prediction performance. The prediction accuracy and the Area Under Receiver Operating Characteristic (ROC) Curve (AUC) were calculated as measures of performance. The calculations are performed on twelve similarity-reduced datasets of the Immune Epitope Data Base (IEDB) and a large dataset of peptide-binding affinities to HLA-DRB1*0101. The results showed that GAPES was reliable and very accurate. We achieved an average prediction accuracy of 93.50% and an average AUC of 0.974 in the IEDB dataset. Also, we achieved an accuracy of 95.125% and an AUC of 0.987 on the HLA-DRB1*0101 allele of the Wang benchmark dataset. The results indicate that the proposed prediction technique "GAPES" is a promising technique that will help researchers and scientists to predict the protein structure and it will assist them in the intelligent design of new epitope-based vaccines. Copyright © 2017 Elsevier B.V. All rights reserved.
Probabilistic biological network alignment.
Todor, Andrei; Dobra, Alin; Kahveci, Tamer
2013-01-01
Interactions between molecules are probabilistic events. An interaction may or may not happen with some probability, depending on a variety of factors such as the size, abundance, or proximity of the interacting molecules. In this paper, we consider the problem of aligning two biological networks. Unlike existing methods, we allow one of the two networks to contain probabilistic interactions. Allowing interaction probabilities makes the alignment more biologically relevant at the expense of explosive growth in the number of alternative topologies that may arise from different subsets of interactions that take place. We develop a novel method that efficiently and precisely characterizes this massive search space. We represent the topological similarity between pairs of aligned molecules (i.e., proteins) with the help of random variables and compute their expected values. We validate our method showing that, without sacrificing the running time performance, it can produce novel alignments. Our results also demonstrate that our method identifies biologically meaningful mappings under a comprehensive set of criteria used in the literature as well as the statistical coherence measure that we developed to analyze the statistical significance of the similarity of the functions of the aligned protein pairs.
Yu, Jinchao; Guerois, Raphaël
2016-12-15
Protein-protein docking methods are of great importance for understanding interactomes at the structural level. It has become increasingly appealing to use not only experimental structures but also homology models of unbound subunits as input for docking simulations. So far we are missing a large scale assessment of the success of rigid-body free docking methods on homology models. We explored how we could benefit from comparative modelling of unbound subunits to expand docking benchmark datasets. Starting from a collection of 3157 non-redundant, high X-ray resolution heterodimers, we developed the PPI4DOCK benchmark containing 1417 docking targets based on unbound homology models. Rigid-body docking by Zdock showed that for 1208 cases (85.2%), at least one correct decoy was generated, emphasizing the efficiency of rigid-body docking in generating correct assemblies. Overall, the PPI4DOCK benchmark contains a large set of realistic cases and provides new ground for assessing docking and scoring methodologies. Benchmark sets can be downloaded from http://biodev.cea.fr/interevol/ppi4dock/ CONTACT: guerois@cea.frSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
GoWeb: a semantic search engine for the life science web.
Dietze, Heiko; Schroeder, Michael
2009-10-01
Current search engines are keyword-based. Semantic technologies promise a next generation of semantic search engines, which will be able to answer questions. Current approaches either apply natural language processing to unstructured text or they assume the existence of structured statements over which they can reason. Here, we introduce a third approach, GoWeb, which combines classical keyword-based Web search with text-mining and ontologies to navigate large results sets and facilitate question answering. We evaluate GoWeb on three benchmarks of questions on genes and functions, on symptoms and diseases, and on proteins and diseases. The first benchmark is based on the BioCreAtivE 1 Task 2 and links 457 gene names with 1352 functions. GoWeb finds 58% of the functional GeneOntology annotations. The second benchmark is based on 26 case reports and links symptoms with diseases. GoWeb achieves 77% success rate improving an existing approach by nearly 20%. The third benchmark is based on 28 questions in the TREC genomics challenge and links proteins to diseases. GoWeb achieves a success rate of 79%. GoWeb's combination of classical Web search with text-mining and ontologies is a first step towards answering questions in the biomedical domain. GoWeb is online at: http://www.gopubmed.org/goweb.
Bidirectional composition on lie groups for gradient-based image alignment.
Mégret, Rémi; Authesserre, Jean-Baptiste; Berthoumieu, Yannick
2010-09-01
In this paper, a new formulation based on bidirectional composition on Lie groups (BCL) for parametric gradient-based image alignment is presented. Contrary to the conventional approaches, the BCL method takes advantage of the gradients of both template and current image without combining them a priori. Based on this bidirectional formulation, two methods are proposed and their relationship with state-of-the-art gradient based approaches is fully discussed. The first one, i.e., the BCL method, relies on the compositional framework to provide the minimization of the compensated error with respect to an augmented parameter vector. The second one, the projected BCL (PBCL), corresponds to a close approximation of the BCL approach. A comparative study is carried out dealing with computational complexity, convergence rate and frequence of convergence. Numerical experiments using a conventional benchmark show the performance improvement especially for asymmetric levels of noise, which is also discussed from a theoretical point of view.
Ndhlovu, Andrew; Durand, Pierre M.; Hazelhurst, Scott
2015-01-01
The evolutionary rate at codon sites across protein-coding nucleotide sequences represents a valuable tier of information for aligning sequences, inferring homology and constructing phylogenetic profiles. However, a comprehensive resource for cataloguing the evolutionary rate at codon sites and their corresponding nucleotide and protein domain sequence alignments has not been developed. To address this gap in knowledge, EvoDB (an Evolutionary rates DataBase) was compiled. Nucleotide sequences and their corresponding protein domain data including the associated seed alignments from the PFAM-A (protein family) database were used to estimate evolutionary rate (ω = dN/dS) profiles at codon sites for each entry. EvoDB contains 98.83% of the gapped nucleotide sequence alignments and 97.1% of the evolutionary rate profiles for the corresponding information in PFAM-A. As the identification of codon sites under positive selection and their position in a sequence profile is usually the most sought after information for molecular evolutionary biologists, evolutionary rate profiles were determined under the M2a model using the CODEML algorithm in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite of software. Validation of nucleotide sequences against amino acid data was implemented to ensure high data quality. EvoDB is a catalogue of the evolutionary rate profiles and provides the corresponding phylogenetic trees, PFAM-A alignments and annotated accession identifier data. In addition, the database can be explored and queried using known evolutionary rate profiles to identify domains under similar evolutionary constraints and pressures. EvoDB is a resource for evolutionary, phylogenetic studies and presents a tier of information untapped by current databases. Database URL: http://www.bioinf.wits.ac.za/software/fire/evodb PMID:26140928
Ndhlovu, Andrew; Durand, Pierre M; Hazelhurst, Scott
2015-01-01
The evolutionary rate at codon sites across protein-coding nucleotide sequences represents a valuable tier of information for aligning sequences, inferring homology and constructing phylogenetic profiles. However, a comprehensive resource for cataloguing the evolutionary rate at codon sites and their corresponding nucleotide and protein domain sequence alignments has not been developed. To address this gap in knowledge, EvoDB (an Evolutionary rates DataBase) was compiled. Nucleotide sequences and their corresponding protein domain data including the associated seed alignments from the PFAM-A (protein family) database were used to estimate evolutionary rate (ω = dN/dS) profiles at codon sites for each entry. EvoDB contains 98.83% of the gapped nucleotide sequence alignments and 97.1% of the evolutionary rate profiles for the corresponding information in PFAM-A. As the identification of codon sites under positive selection and their position in a sequence profile is usually the most sought after information for molecular evolutionary biologists, evolutionary rate profiles were determined under the M2a model using the CODEML algorithm in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite of software. Validation of nucleotide sequences against amino acid data was implemented to ensure high data quality. EvoDB is a catalogue of the evolutionary rate profiles and provides the corresponding phylogenetic trees, PFAM-A alignments and annotated accession identifier data. In addition, the database can be explored and queried using known evolutionary rate profiles to identify domains under similar evolutionary constraints and pressures. EvoDB is a resource for evolutionary, phylogenetic studies and presents a tier of information untapped by current databases. © The Author(s) 2015. Published by Oxford University Press.
Benchmarking pathway interaction network for colorectal cancer to identify dysregulated pathways.
Wang, Q; Shi, C-J; Lv, S-H
2017-03-30
Different pathways act synergistically to participate in many biological processes. Thus, the purpose of our study was to extract dysregulated pathways to investigate the pathogenesis of colorectal cancer (CRC) based on the functional dependency among pathways. Protein-protein interaction (PPI) information and pathway data were retrieved from STRING and Reactome databases, respectively. After genes were aligned to the pathways, each pathway activity was calculated using the principal component analysis (PCA) method, and the seed pathway was discovered. Subsequently, we constructed the pathway interaction network (PIN), where each node represented a biological pathway based on gene expression profile, PPI data, as well as pathways. Dysregulated pathways were then selected from the PIN according to classification performance and seed pathway. A PIN including 11,960 interactions was constructed to identify dysregulated pathways. Interestingly, the interaction of mRNA splicing and mRNA splicing-major pathway had the highest score of 719.8167. Maximum change of the activity score between CRC and normal samples appeared in the pathway of DNA replication, which was selected as the seed pathway. Starting with this seed pathway, a pathway set containing 30 dysregulated pathways was obtained with an area under the curve score of 0.8598. The pathway of mRNA splicing, mRNA splicing-major pathway, and RNA polymerase I had the maximum genes of 107. Moreover, we found that these 30 pathways had crosstalks with each other. The results suggest that these dysregulated pathways might be used as biomarkers to diagnose CRC.
Benchmarking infrastructure for mutation text mining
2014-01-01
Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600
Benchmarking infrastructure for mutation text mining.
Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo
2014-02-25
Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.
Comparative Protein Structure Modeling Using MODELLER
Webb, Benjamin; Sali, Andrej
2016-01-01
Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. PMID:27322406
NASA Astrophysics Data System (ADS)
Hentschke, Reinhard; Herzfeld, Judith
1991-06-01
The reversible association of globular protein molecules in concentrated solution leads to highly polydisperse fibers, e.g., actin filaments, microtubules, and sickle-cell hemoglobin fibers. At high concentrations, excluded-volume interactions between the fibers lead to spontaneous alignment analogous to that in simple lyotropic liquid crystals. However, the phase behavior of reversibly associating proteins is complicated by the threefold coupling between the growth, alignment, and hydration of the fibers. In protein systems aggregates contain substantial solvent, which may cause them to swell or shrink, depending on osmotic stress. Extending previous work, we present a model for the equilibrium phase behavior of the above-noted protein systems in terms of simple intra- and interaggregate interactions, combined with equilibration of fiber-incorporated solvent with the bulk solvent. Specifically, we compare our model results to recent osmotic pressure data for sickle-cell hemoglobin and find excellent agreement. This comparison shows that particle interactions sufficient to cause alignment are also sufficient to squeeze significant amounts of solvent out of protein fibers. In addition, the model is in accord with findings from independent sedimentation and birefringence studies on sickle-cell hemoglobin.
PhyreStorm: A Web Server for Fast Structural Searches Against the PDB.
Mezulis, Stefans; Sternberg, Michael J E; Kelley, Lawrence A
2016-02-22
The identification of structurally similar proteins can provide a range of biological insights, and accordingly, the alignment of a query protein to a database of experimentally determined protein structures is a technique commonly used in the fields of structural and evolutionary biology. The PhyreStorm Web server has been designed to provide comprehensive, up-to-date and rapid structural comparisons against the Protein Data Bank (PDB) combined with a rich and intuitive user interface. It is intended that this facility will enable biologists inexpert in bioinformatics access to a powerful tool for exploring protein structure relationships beyond what can be achieved by sequence analysis alone. By partitioning the PDB into similar structures, PhyreStorm is able to quickly discard the majority of structures that cannot possibly align well to a query protein, reducing the number of alignments required by an order of magnitude. PhyreStorm is capable of finding 93±2% of all highly similar (TM-score>0.7) structures in the PDB for each query structure, usually in less than 60s. PhyreStorm is available at http://www.sbg.bio.ic.ac.uk/phyrestorm/. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Prediction of β-turns in proteins from multiple alignment using neural network
Kaur, Harpreet; Raghava, Gajendra Pal Singh
2003-01-01
A neural network-based method has been developed for the prediction of β-turns in proteins by using multiple sequence alignment. Two feed-forward back-propagation networks with a single hidden layer are used where the first-sequence structure network is trained with the multiple sequence alignment in the form of PSI-BLAST–generated position-specific scoring matrices. The initial predictions from the first network and PSIPRED-predicted secondary structure are used as input to the second structure-structure network to refine the predictions obtained from the first net. A significant improvement in prediction accuracy has been achieved by using evolutionary information contained in the multiple sequence alignment. The final network yields an overall prediction accuracy of 75.5% when tested by sevenfold cross-validation on a set of 426 nonhomologous protein chains. The corresponding Qpred, Qobs, and Matthews correlation coefficient values are 49.8%, 72.3%, and 0.43, respectively, and are the best among all the previously published β-turn prediction methods. The Web server BetaTPred2 (http://www.imtech.res.in/raghava/betatpred2/) has been developed based on this approach. PMID:12592033
Yu, Yi-Kuo; Capra, John A.; Stojmirović, Aleksandar; Landsman, David; Altschul, Stephen F.
2015-01-01
Motivation: DNA and protein patterns are usefully represented by sequence logos. However, the methods for logo generation in common use lack a proper statistical basis, and are non-optimal for recognizing functionally relevant alignment columns. Results: We redefine the information at a logo position as a per-observation multiple alignment log-odds score. Such scores are positive or negative, depending on whether a column’s observations are better explained as arising from relatedness or chance. Within this framework, we propose distinct normalized maximum likelihood and Bayesian measures of column information. We illustrate these measures on High Mobility Group B (HMGB) box proteins and a dataset of enzyme alignments. Particularly in the context of protein alignments, our measures improve the discrimination of biologically relevant positions. Availability and implementation: Our new measures are implemented in an open-source Web-based logo generation program, which is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/logoddslogo/index.html. A stand-alone version of the program is also available from this site. Contact: altschul@ncbi.nlm.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25294922
"Master-Slave" Biological Network Alignment
NASA Astrophysics Data System (ADS)
Ferraro, Nicola; Palopoli, Luigi; Panni, Simona; Rombo, Simona E.
Performing global alignment between protein-protein interaction (PPI) networks of different organisms is important to infer knowledge about conservation across species. Known methods that perform this task operate symmetrically, that is to say, they do not assign a distinct role to the input PPI networks. However, in most cases, the input networks are indeed distinguishable on the basis of how well the corresponding organism is biologically well-characterized. For well-characterized organisms the associated PPI network supposedly encode in a sound manner all the information about their proteins and associated interactions, which is far from being the case for not well characterized ones. Here the new idea is developed to devise a method for global alignment of PPI networks that in fact exploit differences in the characterization of organisms at hand. We assume that the PPI network (called Master) of the best characterized is used as a fingerprint to guide the alignment process to the second input network (called Slave), so that generated results preferably retain the structural characteristics of the Master (and using the Slave) network. We tested our method showing that the results it returns are biologically relevant.
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
Abnousi, Armen; Broschat, Shira L.; Kalyanaraman, Ananth
2016-01-01
Background Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences. PMID:27552220
Zhang, Gaihua; Su, Zhen
2012-01-01
Work on protein structure prediction is very useful in biological research. To evaluate their accuracy, experimental protein structures or their derived data are used as the 'gold standard'. However, as proteins are dynamic molecular machines with structural flexibility such a standard may be unreliable. To investigate the influence of the structure flexibility, we analysed 3,652 protein structures of 137 unique sequences from 24 protein families. The results showed that (1) the three-dimensional (3D) protein structures were not rigid: the root-mean-square deviation (RMSD) of the backbone Cα of structures with identical sequences was relatively large, with the average of the maximum RMSD from each of the 137 sequences being 1.06 Å; (2) the derived data of the 3D structure was not constant, e.g. the highest ratio of the secondary structure wobble site was 60.69%, with the sequence alignments from structural comparisons of two proteins in the same family sometimes being completely different. Proteins may have several stable conformations and the data derived from resolved structures as a 'gold standard' should be optimized before being utilized as criteria to evaluate the prediction methods, e.g. sequence alignment from structural comparison. Helix/β-sheet transition exists in normal free proteins. The coil ratio of the 3D structure could affect its resolution as determined by X-ray crystallography.
Multicolor microcontact printing of proteins on nanoporous surface for patterned immunoassay
NASA Astrophysics Data System (ADS)
Ng, Elaine; Gopal, Ashwini; Hoshino, Kazunori; Zhang, Xiaojing
2011-07-01
The large scale patterning of therapeutic proteins is a key to the efficient design, characterization, and production of biologics for cost effective, high throughput, and point-of-care detection and analysis system. We demonstrate an efficient method for protein deposition and adsorption on nanoporous silica substrates in specific patterns using a method called "micro-contact printing". Multiple color-tagged proteins can be printed through sequential application of such micro-patterning technique. Two groups of experiments were performed. In the first group, the protein stamp was aligned precisely with the printing sites, where the stamp was applied multiple times. Optimal conditions were identified for protein transfer and adsorption using the pore size of 4 nm and thickness of 30 nm porous silica thin film. In the second group, we demonstrate the patterning of two-color rabbit immunoglobin labeled with fluorescein isothiocyanate and tetramethyl rhodamine iso-thiocyanate on porous silica substrates that have a pore size 4 nm, porosity 57% and thickness of the porous layer 30 nm. A pair of protein stamps, with corresponding alignment markings and coupled patterns, were aligned and used to produce a two-colored stamp pattern of proteins on porous silica. Different colored proteins can be applied to exemplify the diverse protein composition within a sample. This method of multicolor microcontact printing can be used to perform a fluorescence-based patterned enzyme-linked immunosorbent assay to detect the presence of various proteins within a sample.
NASA Astrophysics Data System (ADS)
Gong, K.; Fritsch, D.
2018-05-01
Nowadays, multiple-view stereo satellite imagery has become a valuable data source for digital surface model generation and 3D reconstruction. In 2016, a well-organized multiple view stereo publicly benchmark for commercial satellite imagery has been released by the John Hopkins University Applied Physics Laboratory, USA. This benchmark motivates us to explore the method that can generate accurate digital surface models from a large number of high resolution satellite images. In this paper, we propose a pipeline for processing the benchmark data to digital surface models. As a pre-procedure, we filter all the possible image pairs according to the incidence angle and capture date. With the selected image pairs, the relative bias-compensated model is applied for relative orientation. After the epipolar image pairs' generation, dense image matching and triangulation, the 3D point clouds and DSMs are acquired. The DSMs are aligned to a quasi-ground plane by the relative bias-compensated model. We apply the median filter to generate the fused point cloud and DSM. By comparing with the reference LiDAR DSM, the accuracy, the completeness and the robustness are evaluated. The results show, that the point cloud reconstructs the surface with small structures and the fused DSM generated by our pipeline is accurate and robust.
Benchmark data for identifying multi-functional types of membrane proteins.
Wan, Shibiao; Mak, Man-Wai; Kung, Sun-Yuan
2016-09-01
Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. In this article, we provide data that are used for training and testing Mem-ADSVM (Wan et al., 2016. "Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins" [1]), a two-layer multi-label predictor for predicting multi-functional types of membrane proteins.
Improving Protein Fold Recognition by Deep Learning Networks
NASA Astrophysics Data System (ADS)
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-01
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl’s benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
FASMA: a service to format and analyze sequences in multiple alignments.
Costantini, Susan; Colonna, Giovanni; Facchiano, Angelo M
2007-12-01
Multiple sequence alignments are successfully applied in many studies for under- standing the structural and functional relations among single nucleic acids and protein sequences as well as whole families. Because of the rapid growth of sequence databases, multiple sequence alignments can often be very large and difficult to visualize and analyze. We offer a new service aimed to visualize and analyze the multiple alignments obtained with different external algorithms, with new features useful for the comparison of the aligned sequences as well as for the creation of a final image of the alignment. The service is named FASMA and is available at http://bioinformatica.isa.cnr.it/FASMA/.
Jefferson, Emily R.; Walsh, Thomas P.; Roberts, Timothy J.; Barton, Geoffrey J.
2007-01-01
SNAPPI-DB, a high performance database of Structures, iNterfaces and Alignments of Protein–Protein Interactions, and its associated Java Application Programming Interface (API) is described. SNAPPI-DB contains structural data, down to the level of atom co-ordinates, for each structure in the Protein Data Bank (PDB) together with associated data including SCOP, CATH, Pfam, SWISSPROT, InterPro, GO terms, Protein Quaternary Structures (PQS) and secondary structure information. Domain–domain interactions are stored for multiple domain definitions and are classified by their Superfamily/Family pair and interaction interface. Each set of classified domain–domain interactions has an associated multiple structure alignment for each partner. The API facilitates data access via PDB entries, domains and domain–domain interactions. Rapid development, fast database access and the ability to perform advanced queries without the requirement for complex SQL statements are provided via an object oriented database and the Java Data Objects (JDO) API. SNAPPI-DB contains many features which are not available in other databases of structural protein–protein interactions. It has been applied in three studies on the properties of protein–protein interactions and is currently being employed to train a protein–protein interaction predictor and a functional residue predictor. The database, API and manual are available for download at: . PMID:17202171
2006-09-01
exploited and that we get best value for money from our investment. We announced in the Strategy that we had set in place an evidence-based peer review of...currently meets the Department’s needs. The study was also to set a benchmark for future regular reviews of the programme to ensure quality, value for...The level of resources devoted to such research should be seen in the context of the overall value of expenditure flowing from such decisions. The
Creating a meaningful infection control program: one home healthcare agency's lessons.
Poff, Renee McCoy; Browning, Sarah Via
2014-03-01
Creating a meaningful infection control program in the home care setting proved to be challenging for agency leaders of one hospital-based home healthcare agency. Challenges arose when agency leaders provided infection control (IC) data to the hospital's IC Committee. The IC Section Chief asked for national benchmark comparisons to align home healthcare reporting to that of the hospital level. At that point, it was evident that the home healthcare IC program lacked definition and structure. The purpose of this article is to share how one agency built a meaningful IC program.
Overkamp, Wout; Beilharz, Katrin; Detert Oude Weme, Ruud; Solopova, Ana; Karsens, Harma; Kovács, Ákos T.; Kok, Jan
2013-01-01
Green fluorescent protein (GFP) offers efficient ways of visualizing promoter activity and protein localization in vivo, and many different variants are currently available to study bacterial cell biology. Which of these variants is best suited for a certain bacterial strain, goal, or experimental condition is not clear. Here, we have designed and constructed two “superfolder” GFPs with codon adaptation specifically for Bacillus subtilis and Streptococcus pneumoniae and have benchmarked them against five other previously available variants of GFP in B. subtilis, S. pneumoniae, and Lactococcus lactis, using promoter-gfp fusions. Surprisingly, the best-performing GFP under our experimental conditions in B. subtilis was the one codon optimized for S. pneumoniae and vice versa. The data and tools described in this study will be useful for cell biology studies in low-GC-rich Gram-positive bacteria. PMID:23956387
Goonesekere, Nalin Cw
2009-01-01
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold
Li, Weizhong; Lopez, Rodrigo
2017-01-01
Abstract Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity. PMID:27923999
Ramachandran analysis of conserved glycyl residues in homologous proteins of known structure.
Lakshmi, Balasubramanian; Sinduja, Chandrasekaran; Archunan, Govind; Srinivasan, Narayanaswamy
2014-06-01
High conservation of glycyl residues in homologous proteins is fairly frequent. It is commonly understood that glycine tends to be highly conserved either because of its unique Ramachandran angles or to avoid steric clash that would arise with a larger side chain. Using a database of aligned 3D structures of homologous proteins we identified conserved Gly in 288 alignment positions from 85 families. Ninety-six of these alignment positions correspond to conserved Gly residue with (φ, ψ) values allowed for non-glycyl residues. Reasons for this observation were investigated by in-silico mutation of these glycyl residues to Ala. We found in 94% of the cases a short contact exists between the C(β) atom of the introduced Ala with the atoms which are often distant in the primary structure. This suggests the lack of space even for a short side chain thereby explaining high conservation of glycyl residues even when they adopt (φ, ψ) values allowed for Ala. In 189 alignment positions, the conserved glycyl residues adopt (φ, ψ) values which are disallowed for Ala. In-silico mutation of these Gly residues to Ala almost always results in steric hindrance involving C(β) atom of Ala as one would expect by comparing Ramachandran maps for Ala and Gly. Rare occurrence of the disallowed glycyl conformations even in ultrahigh resolution protein structures are accompanied by short contacts in the crystal structures and such disallowed conformations are not conserved in the homologues. These observations raise the doubt on the accuracy of such glycyl conformations in proteins. © 2014 The Protein Society.
Genetic algorithms for protein threading.
Yadgari, J; Amir, A; Unger, R
1998-01-01
Despite many years of efforts, a direct prediction of protein structure from sequence is still not possible. As a result, in the last few years researchers have started to address the "inverse folding problem": Identifying and aligning a sequence to the fold with which it is most compatible, a process known as "threading". In two meetings in which protein folding predictions were objectively evaluated, it became clear that threading as a concept promises a real breakthrough, but that much improvement is still needed in the technique itself. Threading is a NP-hard problem, and thus no general polynomial solution can be expected. Still a practical approach with demonstrated ability to find optimal solutions in many cases, and acceptable solutions in other cases, is needed. We applied the technique of Genetic Algorithms in order to significantly improve the ability of threading algorithms to find the optimal alignment of a sequence to a structure, i.e. the alignment with the minimum free energy. A major progress reported here is the design of a representation of the threading alignment as a string of fixed length. With this representation validation of alignments and genetic operators are effectively implemented. Appropriate data structure and parameters have been selected. It is shown that Genetic Algorithm threading is effective and is able to find the optimal alignment in a few test cases. Furthermore, the described algorithm is shown to perform well even without pre-definition of core elements. Existing threading methods are dependent on such constraints to make their calculations feasible. But the concept of core elements is inherently arbitrary and should be avoided if possible. While a rigorous proof is hard to submit yet an, we present indications that indeed Genetic Algorithm threading is capable of finding consistently good solutions of full alignments in search spaces of size up to 10(70).
Yan, Yumeng; Tao, Huanyu; Huang, Sheng-You
2018-05-26
A major subclass of protein-protein interactions is formed by homo-oligomers with certain symmetry. Therefore, computational modeling of the symmetric protein complexes is important for understanding the molecular mechanism of related biological processes. Although several symmetric docking algorithms have been developed for Cn symmetry, few docking servers have been proposed for Dn symmetry. Here, we present HSYMDOCK, a web server of our hierarchical symmetric docking algorithm that supports both Cn and Dn symmetry. The HSYMDOCK server was extensively evaluated on three benchmarks of symmetric protein complexes, including the 20 CASP11-CAPRI30 homo-oligomer targets, the symmetric docking benchmark of 213 Cn targets and 35 Dn targets, and a nonredundant test set of 55 transmembrane proteins. It was shown that HSYMDOCK obtained a significantly better performance than other similar docking algorithms. The server supports both sequence and structure inputs for the monomer/subunit. Users have an option to provide the symmetry type of the complex, or the server can predict the symmetry type automatically. The docking process is fast and on average consumes 10∼20 min for a docking job. The HSYMDOCK web server is available at http://huanglab.phys.hust.edu.cn/hsymdock/.
Multi-Label Learning via Random Label Selection for Protein Subcellular Multi-Locations Prediction.
Wang, Xiao; Li, Guo-Zheng
2013-03-12
Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multi-location proteins to multiple proteins with single location, which doesn't take correlations among different subcellular locations into account. In this paper, a novel method named RALS (multi-label learning via RAndom Label Selection), is proposed to learn from multi-location proteins in an effective and efficient way. Through five-fold cross validation test on a benchmark dataset, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark datasets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multi-locations of proteins. The prediction web server is available at http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.
Fu, Kun; Jin, Junqi; Cui, Runpeng; Sha, Fei; Zhang, Changshui
2017-12-01
Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image captioning system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifts among the visual regions-such transitions impose a thread of ordering in visual perception. This alignment characterizes the flow of latent meaning, which encodes what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets, using both automatic evaluation metrics and human evaluation. We show that either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.
Paulovich, Amanda G.; Billheimer, Dean; Ham, Amy-Joan L.; Vega-Montoto, Lorenzo; Rudnick, Paul A.; Tabb, David L.; Wang, Pei; Blackman, Ronald K.; Bunk, David M.; Cardasis, Helene L.; Clauser, Karl R.; Kinsinger, Christopher R.; Schilling, Birgit; Tegeler, Tony J.; Variyath, Asokan Mulayath; Wang, Mu; Whiteaker, Jeffrey R.; Zimmerman, Lisa J.; Fenyo, David; Carr, Steven A.; Fisher, Susan J.; Gibson, Bradford W.; Mesri, Mehdi; Neubert, Thomas A.; Regnier, Fred E.; Rodriguez, Henry; Spiegelman, Cliff; Stein, Stephen E.; Tempst, Paul; Liebler, Daniel C.
2010-01-01
Optimal performance of LC-MS/MS platforms is critical to generating high quality proteomics data. Although individual laboratories have developed quality control samples, there is no widely available performance standard of biological complexity (and associated reference data sets) for benchmarking of platform performance for analysis of complex biological proteomes across different laboratories in the community. Individual preparations of the yeast Saccharomyces cerevisiae proteome have been used extensively by laboratories in the proteomics community to characterize LC-MS platform performance. The yeast proteome is uniquely attractive as a performance standard because it is the most extensively characterized complex biological proteome and the only one associated with several large scale studies estimating the abundance of all detectable proteins. In this study, we describe a standard operating protocol for large scale production of the yeast performance standard and offer aliquots to the community through the National Institute of Standards and Technology where the yeast proteome is under development as a certified reference material to meet the long term needs of the community. Using a series of metrics that characterize LC-MS performance, we provide a reference data set demonstrating typical performance of commonly used ion trap instrument platforms in expert laboratories; the results provide a basis for laboratories to benchmark their own performance, to improve upon current methods, and to evaluate new technologies. Additionally, we demonstrate how the yeast reference, spiked with human proteins, can be used to benchmark the power of proteomics platforms for detection of differentially expressed proteins at different levels of concentration in a complex matrix, thereby providing a metric to evaluate and minimize preanalytical and analytical variation in comparative proteomics experiments. PMID:19858499
The effect of patterning options on embedded memory cells in logic technologies at iN10 and iN7
NASA Astrophysics Data System (ADS)
Appeltans, Raf; Weckx, Pieter; Raghavan, Praveen; Kim, Ryoung-Han; Kar, Gouri Sankar; Furnémont, Arnaud; Van der Perre, Liesbet; Dehaene, Wim
2017-03-01
Static Random Access Memory (SRAM) cells are used together with logic standard cells as the benchmark to develop the process flow for new logic technologies. In order to achieve successful integration of Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) as area efficient higher level embedded cache, it also needs to be included as a benchmark. The simple cell structure of STT-MRAM brings extra patterning challenges to achieve high density. The two memory types are compared in terms of minimum area and critical design rules in both the iN10 and iN7 node, with an extra focus on patterning options in iN7. Both the use of Self-Aligned Quadruple Patterning (SAQP) mandrel and spacer engineering, as well as multi-level via's are explored. These patterning options result in large area gains for the STT-MRAM cell and moreover determine which cell variant is the smallest.
Protein docking prediction using predicted protein-protein interface.
Li, Bin; Kihara, Daisuke
2012-01-10
Many important cellular processes are carried out by protein complexes. To provide physical pictures of interacting proteins, many computational protein-protein prediction methods have been developed in the past. However, it is still difficult to identify the correct docking complex structure within top ranks among alternative conformations. We present a novel protein docking algorithm that utilizes imperfect protein-protein binding interface prediction for guiding protein docking. Since the accuracy of protein binding site prediction varies depending on cases, the challenge is to develop a method which does not deteriorate but improves docking results by using a binding site prediction which may not be 100% accurate. The algorithm, named PI-LZerD (using Predicted Interface with Local 3D Zernike descriptor-based Docking algorithm), is based on a pair wise protein docking prediction algorithm, LZerD, which we have developed earlier. PI-LZerD starts from performing docking prediction using the provided protein-protein binding interface prediction as constraints, which is followed by the second round of docking with updated docking interface information to further improve docking conformation. Benchmark results on bound and unbound cases show that PI-LZerD consistently improves the docking prediction accuracy as compared with docking without using binding site prediction or using the binding site prediction as post-filtering. We have developed PI-LZerD, a pairwise docking algorithm, which uses imperfect protein-protein binding interface prediction to improve docking accuracy. PI-LZerD consistently showed better prediction accuracy over alternative methods in the series of benchmark experiments including docking using actual docking interface site predictions as well as unbound docking cases.
A molecular-field-based similarity study of non-nucleoside HIV-1 reverse transcriptase inhibitors
NASA Astrophysics Data System (ADS)
Mestres, Jordi; Rohrer, Douglas C.; Maggiora, Gerald M.
1999-01-01
This article describes a molecular-field-based similarity method for aligning molecules by matching their steric and electrostatic fields and an application of the method to the alignment of three structurally diverse non-nucleoside HIV-1 reverse transcriptase inhibitors. A brief description of the method, as implemented in the program MIMIC, is presented, including a discussion of pairwise and multi-molecule similarity-based matching. The application provides an example that illustrates how relative binding orientations of molecules can be determined in the absence of detailed structural information on their target protein. In the particular system studied here, availability of the X-ray crystal structures of the respective ligand-protein complexes provides a means for constructing an 'experimental model' of the relative binding orientations of the three inhibitors. The experimental model is derived by using MIMIC to align the steric fields of the three protein P66 subunit main chains, producing an overlay with a 1.41 Å average rms distance between the corresponding Cα's in the three chains. The inter-chain residue similarities for the backbone structures show that the main-chain conformations are conserved in the region of the inhibitor-binding site, with the major deviations located primarily in the 'finger' and RNase H regions. The resulting inhibitor structure overlay provides an experimental-based model that can be used to evaluate the quality of the direct a priori inhibitor alignment obtained using MIMIC. It is found that the 'best' pairwise alignments do not always correspond to the experimental model alignments. Therefore, simply combining the best pairwise alignments will not necessarily produce the optimal multi-molecule alignment. However, the best simultaneous three-molecule alignment was found to reproduce the experimental inhibitor alignment model. A pairwise consistency index has been derived which gauges the quality of combining the pairwise alignments and aids in efficiently forming the optimal multi-molecule alignment analysis. Two post-alignment procedures are described that provide information on feature-based and field-based pharmacophoric patterns. The former corresponds to traditional pharmacophore models and is derived from the contribution of individual atoms to the total similarity. The latter is based on molecular regions rather than atoms and is constructed by computing the percent contribution to the similarity of individual points in a regular lattice surrounding the molecules, which when contoured and colored visually depict regions of highly conserved similarity. A discussion of how the information provided by each of the procedures is useful in drug design is also presented.
Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment
2011-01-01
Background Many Bioinformatics studies begin with a multiple sequence alignment as the foundation for their research. This is because multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence structure relationships. Results In this paper, we have proposed a Vertical Decomposition with Genetic Algorithm (VDGA) for Multiple Sequence Alignment (MSA). In VDGA, we divide the sequences vertically into two or more subsequences, and then solve them individually using a guide tree approach. Finally, we combine all the subsequences to generate a new multiple sequence alignment. This technique is applied on the solutions of the initial generation and of each child generation within VDGA. We have used two mechanisms to generate an initial population in this research: the first mechanism is to generate guide trees with randomly selected sequences and the second is shuffling the sequences inside such trees. Two different genetic operators have been implemented with VDGA. To test the performance of our algorithm, we have compared it with existing well-known methods, namely PRRP, CLUSTALX, DIALIGN, HMMT, SB_PIMA, ML_PIMA, MULTALIGN, and PILEUP8, and also other methods, based on Genetic Algorithms (GA), such as SAGA, MSA-GA and RBT-GA, by solving a number of benchmark datasets from BAliBase 2.0. Conclusions The experimental results showed that the VDGA with three vertical divisions was the most successful variant for most of the test cases in comparison to other divisions considered with VDGA. The experimental results also confirmed that VDGA outperformed the other methods considered in this research. PMID:21867510
Sadygov, Rovshan G; Maroto, Fernando Martin; Hühmer, Andreas F R
2006-12-15
We present an algorithmic approach to align three-dimensional chromatographic surfaces of LC-MS data of complex mixture samples. The approach consists of two steps. In the first step, we prealign chromatographic profiles: two-dimensional projections of chromatographic surfaces. This is accomplished by correlation analysis using fast Fourier transforms. In this step, a temporal offset that maximizes the overlap and dot product between two chromatographic profiles is determined. In the second step, the algorithm generates correlation matrix elements between full mass scans of the reference and sample chromatographic surfaces. The temporal offset from the first step indicates a range of the mass scans that are possibly correlated, then the correlation matrix is calculated only for these mass scans. The correlation matrix carries information on highly correlated scans, but it does not itself determine the scan or time alignment. Alignment is determined as a path in the correlation matrix that maximizes the sum of the correlation matrix elements. The computational complexity of the optimal path generation problem is reduced by the use of dynamic programming. The program produces time-aligned surfaces. The use of the temporal offset from the first step in the second step reduces the computation time for generating the correlation matrix and speeds up the process. The algorithm has been implemented in a program, ChromAlign, developed in C++ language for the .NET2 environment in WINDOWS XP. In this work, we demonstrate the applications of ChromAlign to alignment of LC-MS surfaces of several datasets: a mixture of known proteins, samples from digests of surface proteins of T-cells, and samples prepared from digests of cerebrospinal fluid. ChromAlign accurately aligns the LC-MS surfaces we studied. In these examples, we discuss various aspects of the alignment by ChromAlign, such as constant time axis shifts and warping of chromatographic surfaces.
Sequence harmony: detecting functional specificity from alignments
Feenstra, K. Anton; Pirovano, Walter; Krab, Klaas; Heringa, Jaap
2007-01-01
Multiple sequence alignments are often used for the identification of key specificity-determining residues within protein families. We present a web server implementation of the Sequence Harmony (SH) method previously introduced. SH accurately detects subfamily specific positions from a multiple alignment by scoring compositional differences between subfamilies, without imposing conservation. The SH web server allows a quick selection of subtype specific sites from a multiple alignment given a subfamily grouping. In addition, it allows the predicted sites to be directly mapped onto a protein structure and displayed. We demonstrate the use of the SH server using the family of plant mitochondrial alternative oxidases (AOX). In addition, we illustrate the usefulness of combining sequence and structural information by showing that the predicted sites are clustered into a few distinct regions in an AOX homology model. The SH web server can be accessed at www.ibi.vu.nl/programs/seqharmwww. PMID:17584793
GATA: A graphic alignment tool for comparative sequenceanalysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nix, David A.; Eisen, Michael B.
2005-01-01
Several problems exist with current methods used to align DNA sequences for comparative sequence analysis. Most dynamic programming algorithms assume that conserved sequence elements are collinear. This assumption appears valid when comparing orthologous protein coding sequences. Functional constraints on proteins provide strong selective pressure against sequence inversions, and minimize sequence duplications and feature shuffling. For non-coding sequences this collinearity assumption is often invalid. For example, enhancers contain clusters of transcription factor binding sites that change in number, orientation, and spacing during evolution yet the enhancer retains its activity. Dotplot analysis is often used to estimate non-coding sequence relatedness. Yet dotmore » plots do not actually align sequences and thus cannot account well for base insertions or deletions. Moreover, they lack an adequate statistical framework for comparing sequence relatedness and are limited to pairwise comparisons. Lastly, dot plots and dynamic programming text outputs fail to provide an intuitive means for visualizing DNA alignments.« less
REPPER—repeats and their periodicities in fibrous proteins
Gruber, Markus; Söding, Johannes; Lupas, Andrei N.
2005-01-01
REPPER (REPeats and their PERiodicities) is an integrated server that detects and analyzes regions with short gapless repeats in protein sequences or alignments. It finds periodicities by Fourier Transform (FTwin) and internal similarity analysis (REPwin). FTwin assigns numerical values to amino acids that reflect certain properties, for instance hydrophobicity, and gives information on corresponding periodicities. REPwin uses self-alignments and displays repeats that reveal significant internal similarities. Both programs use a sliding window to ensure that different periodic regions within the same protein are detected independently. FTwin and REPwin are complemented by secondary structure prediction (PSIPRED) and coiled coil prediction (COILS), making the server a versatile analysis tool for sequences of fibrous proteins. REPPER is available at . PMID:15980460
The limits of protein sequence comparison?
Pearson, William R; Sierk, Michael L
2010-01-01
Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized. PMID:15919194
Method for protein structure alignment
Blankenbecler, Richard; Ohlsson, Mattias; Peterson, Carsten; Ringner, Markus
2005-02-22
This invention provides a method for protein structure alignment. More particularly, the present invention provides a method for identification, classification and prediction of protein structures. The present invention involves two key ingredients. First, an energy or cost function formulation of the problem simultaneously in terms of binary (Potts) assignment variables and real-valued atomic coordinates. Second, a minimization of the energy or cost function by an iterative method, where in each iteration (1) a mean field method is employed for the assignment variables and (2) exact rotation and/or translation of atomic coordinates is performed, weighted with the corresponding assignment variables.
Establishing homologies in protein sequences
NASA Technical Reports Server (NTRS)
Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.
1983-01-01
Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.
Evolutionary profiles from the QR factorization of multiple sequence alignments
Sethi, Anurag; O'Donoghue, Patrick; Luthey-Schulten, Zaida
2005-01-01
We present an algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of the homologous group. The method, based on the multidimensional QR factorization of numerically encoded multiple sequence alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. We observe a general trend that these smaller, more evolutionarily balanced profiles have comparable and, in many cases, better performance in database searches than conventional profiles containing hundreds of sequences, constructed in an iterative and computationally intensive procedure. For more diverse families or superfamilies, with sequence identity <30%, structural alignments, based purely on the geometry of the protein structures, provide better alignments than pure sequence-based methods. Merging the structure and sequence information allows the construction of accurate profiles for distantly related groups. These structure-based profiles outperformed other sequence-based methods for finding distant homologs and were used to identify a putative class II cysteinyl-tRNA synthetase (CysRS) in several archaea that eluded previous annotation studies. Phylogenetic analysis showed the putative class II CysRSs to be a monophyletic group and homology modeling revealed a constellation of active site residues similar to that in the known class I CysRS. PMID:15741270
Comparative modeling without implicit sequence alignments.
Kolinski, Andrzej; Gront, Dominik
2007-10-01
The number of known protein sequences is about thousand times larger than the number of experimentally solved 3D structures. For more than half of the protein sequences a close or distant structural analog could be identified. The key starting point in a classical comparative modeling is to generate the best possible sequence alignment with a template or templates. With decreasing sequence similarity, the number of errors in the alignments increases and these errors are the main causes of the decreasing accuracy of the molecular models generated. Here we propose a new approach to comparative modeling, which does not require the implicit alignment - the model building phase explores geometric, evolutionary and physical properties of a template (or templates). The proposed method requires prior identification of a template, although the initial sequence alignment is ignored. The model is built using a very efficient reduced representation search engine CABS to find the best possible superposition of the query protein onto the template represented as a 3D multi-featured scaffold. The criteria used include: sequence similarity, predicted secondary structure consistency, local geometric features and hydrophobicity profile. For more difficult cases, the new method qualitatively outperforms existing schemes of comparative modeling. The algorithm unifies de novo modeling, 3D threading and sequence-based methods. The main idea is general and could be easily combined with other efficient modeling tools as Rosetta, UNRES and others.
Finding Protein and Nucleotide Similarities with FASTA
Pearson, William R.
2016-01-01
The FASTA programs provide a comprehensive set of rapid similarity searching tools ( fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local and global similarity searches ( ssearch36, ggsearch36) and for searching with short peptides and oligonucleotides ( fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity (Unit 3.5). The FASTA programs can produce “BLAST-like” alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases (Unit 9.4). The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. PMID:27010337
Calabrò, Emanuele; Magazù, Salvatore
2017-01-01
The aim of this article was to study the effects of mobile phone electromagnetic waves at 1750 MHz on the Amide I and Amide II vibration bands of some proteins in bidistilled water solution by means of Fourier transform infrared (FTIR) spectroscopy and Fourier self-deconvolution (FSD) analysis. The proteins that were used for the experiment were hemoglobin, myoglobin, bovine serum albumin and lysozyme. The exposure system consisted of microwaves emitted by an operational mobile phone at the frequency at 1750 MHz at the average power density of 1 W/m 2 . Exposed and control samples were analyzed using FTIR spectroscopy and FSD analysis. The main result was that Amide I band of the proteins that were used increased significantly (p < 0.05) after 4 h of exposure to MWs, whereas Amide II band did not change significantly. This result can be explained assuming that the α-helix structure of the proteins aligned itself with the direction of the electromagnetic field due to the alignment of C = O stretching and N - H bending ligands that are oriented along with the α-helix axis that give rise to the Amide I mode.
Finding Protein and Nucleotide Similarities with FASTA.
Pearson, William R
2016-03-24
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. Copyright © 2016 John Wiley & Sons, Inc.
Automatic classification of protein structures relying on similarities between alignments
2012-01-01
Background Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. Results When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments. Conclusions We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP. PMID:22974051
COACH: profile-profile alignment of protein families using hidden Markov models.
Edgar, Robert C; Sjölander, Kimmen
2004-05-22
Alignments of two multiple-sequence alignments, or statistical models of such alignments (profiles), have important applications in computational biology. The increased amount of information in a profile versus a single sequence can lead to more accurate alignments and more sensitive homolog detection in database searches. Several profile-profile alignment methods have been proposed and have been shown to improve sensitivity and alignment quality compared with sequence-sequence methods (such as BLAST) and profile-sequence methods (e.g. PSI-BLAST). Here we present a new approach to profile-profile alignment we call Comparison of Alignments by Constructing Hidden Markov Models (HMMs) (COACH). COACH aligns two multiple sequence alignments by constructing a profile HMM from one alignment and aligning the other to that HMM. We compare the alignment accuracy of COACH with two recently published methods: Yona and Levitt's prof_sim and Sadreyev and Grishin's COMPASS. On two sets of reference alignments selected from the FSSP database, we find that COACH is able, on average, to produce alignments giving the best coverage or the fewest errors, depending on the chosen parameter settings. COACH is freely available from www.drive5.com/lobster
DEMO: Sequence Alignment to Predict Across Species Susceptibility
The US Environmental Protection Agency Sequence Alignment to Predict Across Species Susceptibility tool (SeqAPASS; https://seqapass.epa.gov/seqapass/) was developed to comparatively evaluate protein sequence and structural similarity across species as a means to extrapolate toxic...
Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin
2016-06-15
Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx : xin.gao@kaust.edu.sa Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zemla, A; Lang, D; Kostova, T
2010-11-29
Most of the currently used methods for protein function prediction rely on sequence-based comparisons between a query protein and those for which a functional annotation is provided. A serious limitation of sequence similarity-based approaches for identifying residue conservation among proteins is the low confidence in assigning residue-residue correspondences among proteins when the level of sequence identity between the compared proteins is poor. Multiple sequence alignment methods are more satisfactory - still, they cannot provide reliable results at low levels of sequence identity. Our goal in the current work was to develop an algorithm that could overcome these difficulties and facilitatemore » the identification of structurally (and possibly functionally) relevant residue-residue correspondences between compared protein structures. Here we present StralSV, a new algorithm for detecting closely related structure fragments and quantifying residue frequency from tight local structure alignments. We apply StralSV in a study of the RNA-dependent RNA polymerase of poliovirus and demonstrate that the algorithm can be used to determine regions of the protein that are relatively unique or that shared structural similarity with structures that are distantly related. By quantifying residue frequencies among many residue-residue pairs extracted from local alignments, one can infer potential structural or functional importance of specific residues that are determined to be highly conserved or that deviate from a consensus. We further demonstrate that considerable detailed structural and phylogenetic information can be derived from StralSV analyses. StralSV is a new structure-based algorithm for identifying and aligning structure fragments that have similarity to a reference protein. StralSV analysis can be used to quantify residue-residue correspondences and identify residues that may be of particular structural or functional importance, as well as unusual or unexpected residues at a given sequence position.« less
Sequence-similar, structure-dissimilar protein pairs in the PDB.
Kosloff, Mickey; Kolodny, Rachel
2008-05-01
It is often assumed that in the Protein Data Bank (PDB), two proteins with similar sequences will also have similar structures. Accordingly, it has proved useful to develop subsets of the PDB from which "redundant" structures have been removed, based on a sequence-based criterion for similarity. Similarly, when predicting protein structure using homology modeling, if a template structure for modeling a target sequence is selected by sequence alone, this implicitly assumes that all sequence-similar templates are equivalent. Here, we show that this assumption is often not correct and that standard approaches to create subsets of the PDB can lead to the loss of structurally and functionally important information. We have carried out sequence-based structural superpositions and geometry-based structural alignments of a large number of protein pairs to determine the extent to which sequence similarity ensures structural similarity. We find many examples where two proteins that are similar in sequence have structures that differ significantly from one another. The source of the structural differences usually has a functional basis. The number of such proteins pairs that are identified and the magnitude of the dissimilarity depend on the approach that is used to calculate the differences; in particular sequence-based structure superpositioning will identify a larger number of structurally dissimilar pairs than geometry-based structural alignments. When two sequences can be aligned in a statistically meaningful way, sequence-based structural superpositioning provides a meaningful measure of structural differences. This approach and geometry-based structure alignments reveal somewhat different information and one or the other might be preferable in a given application. Our results suggest that in some cases, notably homology modeling, the common use of nonredundant datasets, culled from the PDB based on sequence, may mask important structural and functional information. We have established a data base of sequence-similar, structurally dissimilar protein pairs that will help address this problem (http://luna.bioc.columbia.edu/rachel/seqsimstrdiff.htm).
Chakraborty, Chiranjib; Bandyopadhyay, Sanghamitra; Doss, C George Priya; Agoramoorthy, Govindasamy
2015-04-01
Maturity onset diabetes of the young (MODY) is a metabolic and genetic disorder. It is different from type 1 and type 2 diabetes with low occurrence level (1-2%) among all diabetes. This disorder is a consequence of β-cell dysfunction. Till date, 11 subtypes of MODY have been identified, and all of them can cause gene mutations. However, very little is known about the gene mapping, molecular phylogenetics, and co-expression among MODY genes and networking between cascades. This study has used latest servers and software such as VarioWatch, ClustalW, MUSCLE, G Blocks, Phylogeny.fr, iTOL, WebLogo, STRING, and KEGG PATHWAY to perform comprehensive analyses of gene mapping, multiple sequences alignment, molecular phylogenetics, protein-protein network design, co-expression analysis of MODY genes, and pathway development. The MODY genes are located in chromosomes-2, 7, 8, 9, 11, 12, 13, 17, and 20. Highly aligned block shows Pro, Gly, Leu, Arg, and Pro residues are highly aligned in the positions of 296, 386, 437, 455, 456 and 598, respectively. Alignment scores inform us that HNF1A and HNF1B proteins have shown high sequence similarity among MODY proteins. Protein-protein network design shows that HNF1A, HNF1B, HNF4A, NEUROD1, PDX1, PAX4, INS, and GCK are strongly connected, and the co-expression analyses between MODY genes also show distinct association between HNF1A and HNF4A genes. This study has used latest tools of bioinformatics to develop a rapid method to assess the evolutionary relationship, the network development, and the associations among eleven MODY genes and cascades. The prediction of sequence conservation, molecular phylogenetics, protein-protein network and the association between the MODY cascades enhances opportunities to get more insights into the less-known MODY disease.
Tsigelny, Igor; Sharikov, Yuriy; Ten Eyck, Lynn F
2002-05-01
HMMSPECTR is a tool for finding putative structural homologs for proteins with known primary sequences. HMMSPECTR contains four major components: a data warehouse with the hidden Markov models (HMM) and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of 10-15 "best" proteins from the chosen HMMs. The data warehouse contains four libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER program. The third library contains parts ("partial HMM") of initial alignments. The fourth library contains trained HMMs. We tested our program against all of the protein targets proposed in the CASP4 competition. The data warehouse included libraries of structural alignments and HMMs constructed on the basis of proteins publicly available in the Protein Data Bank before the CASP4 meeting. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases. The improvement is most notable for the targets with complexity 4 (difficult fold recognition cases).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chakraborty, Sandeep; Rao, Basuthkar J.; Baker, Nathan A.
2013-04-01
Phylogenetic analysis of proteins using multiple sequence alignment (MSA) assumes an underlying evolutionary relationship in these proteins which occasionally remains undetected due to considerable sequence divergence. Structural alignment programs have been developed to unravel such fuzzy relationships. However, none of these structure based methods have used electrostatic properties to discriminate between spatially equivalent residues. We present a methodology for MSA of a set of related proteins with known structures using electrostatic properties as an additional discriminator (STEEP). STEEP first extracts a profile, then generates a multiple structural superimposition providing a consolidated spatial framework for comparing residues and finally emits themore » MSA. Residues that are aligned differently by including or excluding electrostatic properties can be targeted by directed evolution experiments to transform the enzymatic properties of one protein into another. We have compared STEEP results to those obtained from a MSA program (ClustalW) and a structural alignment method (MUSTANG) for chymotrypsin serine proteases. Subsequently, we used PhyML to generate phylogenetic trees for the serine and metallo-β-lactamase superfamilies from the STEEP generated MSA, and corroborated the accepted relationships in these superfamilies. We have observed that STEEP acts as a functional classifier when electrostatic congruence is used as a discriminator, and thus identifies potential targets for directed evolution experiments. In summary, STEEP is unique among phylogenetic methods for its ability to use electrostatic congruence to specify mutations that might be the source of the functional divergence in a protein family. Based on our results, we also hypothesize that the active site and its close vicinity contains enough information to infer the correct phylogeny for related proteins.« less
Stronghill, P; Pathan, N; Ha, H; Supijono, E; Hasenkampf, C
2010-08-01
A cytological comparative analysis of male meiocytes was performed for Arabidopsis wild type and the ahp2 (hop2) mutant with emphasis on ahp2's largely uncharacterized prophase I. Leptotene progression appeared normal in ahp2 meiocytes; chromosomes exhibited regular axis formation and assumed a typical polarized nuclear organization. In contrast, 4',6'-diamidino-2-phenylindole-stained ahp2 pachytene chromosome spreads demonstrated a severe reduction in stabilized pairing. However, transmission electron microscopy (TEM) analysis of sections from meiocytes revealed that ahp2 chromosome axes underwent significant amounts of close alignment (44% of total axis). This apparent paradox strongly suggests that the Ahp2 protein is involved in the stabilization of homologous chromosome close alignment. Fluorescent in situ hybridization in combination with Zyp1 immunostaining revealed that ahp2 mutants undergo homologous synapsis of the nucleolus-organizer-region-bearing short arms of chromosomes 2 and 4, despite the otherwise "nucleus-wide" lack of stabilized pairing. The duration of ahp2 zygotene was significantly prolonged and is most likely due to difficulties in chromosome alignment stabilization and subsequent synaptonemal complex formation. Ahp2 and Mnd1 proteins have previously been shown, "in vitro," to form a heterodimer. Here we show, "in situ," that the Ahp2 and Mnd1 proteins are synchronous in their appearance and disappearance from meiotic chromosomes. Both the Ahp2 and Mnd1 proteins localize along the chromosomal axis. However, localization of the Ahp2 protein was entirely foci-based whereas Mnd1 protein exhibited an immunostaining pattern with some foci along the axis and a diffuse staining for the rest of the chromosome.
Functional Alignment of Metabolic Networks.
Mazza, Arnon; Wagner, Allon; Ruppin, Eytan; Sharan, Roded
2016-05-01
Network alignment has become a standard tool in comparative biology, allowing the inference of protein function, interaction, and orthology. However, current alignment techniques are based on topological properties of networks and do not take into account their functional implications. Here we propose, for the first time, an algorithm to align two metabolic networks by taking advantage of their coupled metabolic models. These models allow us to assess the functional implications of genes or reactions, captured by the metabolic fluxes that are altered following their deletion from the network. Such implications may spread far beyond the region of the network where the gene or reaction lies. We apply our algorithm to align metabolic networks from various organisms, ranging from bacteria to humans, showing that our alignment can reveal functional orthology relations that are missed by conventional topological alignments.
Protein structure-structure alignment with discrete Fréchet distance.
Jiang, Minghui; Xu, Ying; Zhu, Binhai
2008-02-01
Matching two geometric objects in two-dimensional (2D) and three-dimensional (3D) spaces is a central problem in computer vision, pattern recognition, and protein structure prediction. In particular, the problem of aligning two polygonal chains under translation and rotation to minimize their distance has been studied using various distance measures. It is well known that the Hausdorff distance is useful for matching two point sets, and that the Fréchet distance is a superior measure for matching two polygonal chains. The discrete Fréchet distance closely approximates the (continuous) Fréchet distance, and is a natural measure for the geometric similarity of the folded 3D structures of biomolecules such as proteins. In this paper, we present new algorithms for matching two polygonal chains in two dimensions to minimize their discrete Fréchet distance under translation and rotation, and an effective heuristic for matching two polygonal chains in three dimensions. We also describe our empirical results on the application of the discrete Fréchet distance to protein structure-structure alignment.
Weininger, Arthur; Weininger, Susan
2015-01-01
The ability to identify the functional correlates of structural and sequence variation in proteins is a critical capability. We related structures of influenza A N10 and N11 proteins that have no established function to structures of proteins with known function by identifying spatially conserved atoms. We identified atoms with common distributed spatial occupancy in PDB structures of N10 protein, N11 protein, an influenza A neuraminidase, an influenza B neuraminidase, and a bacterial neuraminidase. By superposing these spatially conserved atoms, we aligned the structures and associated molecules. We report spatially and sequence invariant residues in the aligned structures. Spatially invariant residues in the N6 and influenza B neuraminidase active sites were found in previously unidentified spatially equivalent sites in the N10 and N11 proteins. We found the corresponding secondary and tertiary structures of the aligned proteins to be largely identical despite significant sequence divergence. We found structural precedent in known non-neuraminidase structures for residues exhibiting structural and sequence divergence in the aligned structures. In N10 protein, we identified staphylococcal enterotoxin I-like domains. In N11 protein, we identified hepatitis E E2S-like domains, SARS spike protein-like domains, and toxin components shared by alpha-bungarotoxin, staphylococcal enterotoxin I, anthrax lethal factor, clostridium botulinum neurotoxin, and clostridium tetanus toxin. The presence of active site components common to the N6, influenza B, and S. pneumoniae neuraminidases in the N10 and N11 proteins, combined with the absence of apparent neuraminidase function, suggests that the role of neuraminidases in H17N10 and H18N11 emerging influenza A viruses may have changed. The presentation of E2S-like, SARS spike protein-like, or toxin-like domains by the N10 and N11 proteins in these emerging viruses may indicate that H17N10 and H18N11 sialidase-facilitated cell entry has been supplemented or replaced by sialidase-independent receptor binding to an expanded cell population that may include neurons and T-cells. PMID:25706124
Omori, Satoshi; Kitao, Akio
2013-06-01
We propose a fast clustering and reranking method, CyClus, for protein-protein docking decoys. This method enables comprehensive clustering of whole decoys generated by rigid-body docking using cylindrical approximation of the protein-proteininterface and hierarchical clustering procedures. We demonstrate the clustering and reranking of 54,000 decoy structures generated by ZDOCK for each complex within a few minutes. After parameter tuning for the test set in ZDOCK benchmark 2.0 with the ZDOCK and ZRANK scoring functions, blind tests for the incremental data in ZDOCK benchmark 3.0 and 4.0 were conducted. CyClus successfully generated smaller subsets of decoys containing near-native decoys. For example, the number of decoys required to create subsets containing near-native decoys with 80% probability was reduced from 22% to 50% of the number required in the original ZDOCK. Although specific ZDOCK and ZRANK results were demonstrated, the CyClus algorithm was designed to be more general and can be applied to a wide range of decoys and scoring functions by adjusting just two parameters, p and T. CyClus results were also compared to those from ClusPro. Copyright © 2013 Wiley Periodicals, Inc.
Application of the docking program SOL for CSAR benchmark.
Sulimov, Alexey V; Kutov, Danil C; Oferkin, Igor V; Katkova, Ekaterina V; Sulimov, Vladimir B
2013-08-26
This paper is devoted to results obtained by the docking program SOL and the post-processing program DISCORE at the CSAR benchmark. SOL and DISCORE programs are described. SOL is the original docking program developed on the basis of the genetic algorithm, MMFF94 force field, rigid protein, precalculated energy grid including desolvation in the frame of simplified GB model, vdW, and electrostatic interactions and taking into account the ligand internal strain energy. An important SOL feature is the single- or multi-processor performance for up to hundreds of CPUs. DISCORE improves the binding energy scoring by the local energy optimization of the ligand docked pose and a simple linear regression on the base of available experimental data. The docking program SOL has demonstrated a good ability for correct ligand positioning in the active sites of the tested proteins in most cases of CSAR exercises. SOL and DISCORE have not demonstrated very exciting results on the protein-ligand binding free energy estimation. Nevertheless, for some target proteins, SOL and DISCORE were among the first in prediction of inhibition activity. Ways to improve SOL and DISCORE are discussed.
nGASP - the nematode genome annotation assessment project
DOE Office of Scientific and Technical Information (OSTI.GOV)
Coghlan, A; Fiedler, T J; McKay, S J
2008-12-19
While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner'more » algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders.« less
Drosophila CTCF tandemly aligns with other insulator proteins at the borders of H3K27me3 domains.
Van Bortle, Kevin; Ramos, Edward; Takenaka, Naomi; Yang, Jingping; Wahi, Jessica E; Corces, Victor G
2012-11-01
Several multiprotein DNA complexes capable of insulator activity have been identified in Drosophila melanogaster, yet only CTCF, a highly conserved zinc finger protein, and the transcription factor TFIIIC have been shown to function in mammals. CTCF is involved in diverse nuclear activities, and recent studies suggest that the proteins with which it associates and the DNA sequences that it targets may underlie these various roles. Here we show that the Drosophila homolog of CTCF (dCTCF) aligns in the genome with other Drosophila insulator proteins such as Suppressor of Hairy wing [SU(HW)] and Boundary Element Associated Factor of 32 kDa (BEAF-32) at the borders of H3K27me3 domains, which are also enriched for associated insulator proteins and additional cofactors. RNAi depletion of dCTCF and combinatorial knockdown of gene expression for other Drosophila insulator proteins leads to a reduction in H3K27me3 levels within repressed domains, suggesting that insulators are important for the maintenance of appropriate repressive chromatin structure in Polycomb (Pc) domains. These results shed new insights into the roles of insulators in chromatin domain organization and support recent models suggesting that insulators underlie interactions important for Pc-mediated repression. We reveal an important relationship between dCTCF and other Drosophila insulator proteins and speculate that vertebrate CTCF may also align with other nuclear proteins to accomplish similar functions.
Drosophila CTCF tandemly aligns with other insulator proteins at the borders of H3K27me3 domains
Van Bortle, Kevin; Ramos, Edward; Takenaka, Naomi; Yang, Jingping; Wahi, Jessica E.; Corces, Victor G.
2012-01-01
Several multiprotein DNA complexes capable of insulator activity have been identified in Drosophila melanogaster, yet only CTCF, a highly conserved zinc finger protein, and the transcription factor TFIIIC have been shown to function in mammals. CTCF is involved in diverse nuclear activities, and recent studies suggest that the proteins with which it associates and the DNA sequences that it targets may underlie these various roles. Here we show that the Drosophila homolog of CTCF (dCTCF) aligns in the genome with other Drosophila insulator proteins such as Suppressor of Hairy wing [SU(HW)] and Boundary Element Associated Factor of 32 kDa (BEAF-32) at the borders of H3K27me3 domains, which are also enriched for associated insulator proteins and additional cofactors. RNAi depletion of dCTCF and combinatorial knockdown of gene expression for other Drosophila insulator proteins leads to a reduction in H3K27me3 levels within repressed domains, suggesting that insulators are important for the maintenance of appropriate repressive chromatin structure in Polycomb (Pc) domains. These results shed new insights into the roles of insulators in chromatin domain organization and support recent models suggesting that insulators underlie interactions important for Pc-mediated repression. We reveal an important relationship between dCTCF and other Drosophila insulator proteins and speculate that vertebrate CTCF may also align with other nuclear proteins to accomplish similar functions. PMID:22722341
Bastien, Olivier; Ortet, Philippe; Roy, Sylvaine; Maréchal, Eric
2005-03-10
Popular methods to reconstruct molecular phylogenies are based on multiple sequence alignments, in which addition or removal of data may change the resulting tree topology. We have sought a representation of homologous proteins that would conserve the information of pair-wise sequence alignments, respect probabilistic properties of Z-scores (Monte Carlo methods applied to pair-wise comparisons) and be the basis for a novel method of consistent and stable phylogenetic reconstruction. We have built up a spatial representation of protein sequences using concepts from particle physics (configuration space) and respecting a frame of constraints deduced from pair-wise alignment score properties in information theory. The obtained configuration space of homologous proteins (CSHP) allows the representation of real and shuffled sequences, and thereupon an expression of the TULIP theorem for Z-score probabilities. Based on the CSHP, we propose a phylogeny reconstruction using Z-scores. Deduced trees, called TULIP trees, are consistent with multiple-alignment based trees. Furthermore, the TULIP tree reconstruction method provides a solution for some previously reported incongruent results, such as the apicomplexan enolase phylogeny. The CSHP is a unified model that conserves mutual information between proteins in the way physical models conserve energy. Applications include the reconstruction of evolutionary consistent and robust trees, the topology of which is based on a spatial representation that is not reordered after addition or removal of sequences. The CSHP and its assigned phylogenetic topology, provide a powerful and easily updated representation for massive pair-wise genome comparisons based on Z-score computations.
Delineating slowly and rapidly evolving fractions of the Drosophila genome.
Keith, Jonathan M; Adams, Peter; Stephen, Stuart; Mattick, John S
2008-05-01
Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.
Protein Sectors: Statistical Coupling Analysis versus Conservation
Teşileanu, Tiberiu; Colwell, Lucy J.; Leibler, Stanislas
2015-01-01
Statistical coupling analysis (SCA) is a method for analyzing multiple sequence alignments that was used to identify groups of coevolving residues termed “sectors”. The method applies spectral analysis to a matrix obtained by combining correlation information with sequence conservation. It has been asserted that the protein sectors identified by SCA are functionally significant, with different sectors controlling different biochemical properties of the protein. Here we reconsider the available experimental data and note that it involves almost exclusively proteins with a single sector. We show that in this case sequence conservation is the dominating factor in SCA, and can alone be used to make statistically equivalent functional predictions. Therefore, we suggest shifting the experimental focus to proteins for which SCA identifies several sectors. Correlations in protein alignments, which have been shown to be informative in a number of independent studies, would then be less dominated by sequence conservation. PMID:25723535
Jones, David T; Kandathil, Shaun M
2018-04-26
In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. DeepCov is freely available at https://github.com/psipred/DeepCov. d.t.jones@ucl.ac.uk.
FLASHFLOOD: A 3D Field-based similarity search and alignment method for flexible molecules
NASA Astrophysics Data System (ADS)
Pitman, Michael C.; Huber, Wolfgang K.; Horn, Hans; Krämer, Andreas; Rice, Julia E.; Swope, William C.
2001-07-01
A three-dimensional field-based similarity search and alignment method for flexible molecules is introduced. The conformational space of a flexible molecule is represented in terms of fragments and torsional angles of allowed conformations. A user-definable property field is used to compute features of fragment pairs. Features are generalizations of CoMMA descriptors (Silverman, B.D. and Platt, D.E., J. Med. Chem., 39 (1996) 2129.) that characterize local regions of the property field by its local moments. The features are invariant under coordinate system transformations. Features taken from a query molecule are used to form alignments with fragment pairs in the database. An assembly algorithm is then used to merge the fragment pairs into full structures, aligned to the query. Key to the method is the use of a context adaptive descriptor scaling procedure as the basis for similarity. This allows the user to tune the weights of the various feature components based on examples relevant to the particular context under investigation. The property fields may range from simple, phenomenological fields, to fields derived from quantum mechanical calculations. We apply the method to the dihydrofolate/methotrexate benchmark system, and show that when one injects relevant contextual information into the descriptor scaling procedure, better results are obtained more efficiently. We also show how the method works and include computer times for a query from a database that represents approximately 23 million conformers of seventeen flexible molecules.
NASA Astrophysics Data System (ADS)
Prates, G.; Berrocoso, M.; Fernández-Ros, A.; García, A.; Ortiz, R.
2012-04-01
El Hierro Island (Canary Islands, Spain) has undergone a submarine eruption a few kilometers to its southeast, detected October 10, on the rift alignment that cuts across the island. However, the seismicity level suddenly increased around July 17 and ground deformation was detected by the only continuously observed GNSS-GPS (Global Navigation Satellite Systems - Global Positioning System) benchmark FRON in the El Golfo area. Based on that information several other GNSS-GPS benchmarks were installed, some of which continuously observed as well. A normal vector analysis was applied to these collected data. The normal vector magnitude variation identified local extension-compression regimes, while the normal vector inclination showed the relative height variation between the three benchmarks that define the plan to which normal vector is analyzed. To accomplish this analysis the data was previously processed to achieve positioning solutions every 30 minutes using the Bernese GPS Software 5.0, further enhanced by a Discrete Kalman Filter, giving an overall millimeter level precision. These solutions were reached using the IGS (International GNSS Service) ultra-rapid orbits and the double-differenced ionosphere-free combination. With this strategy the positioning solutions were attained in near real-time. Later with the IGS rapid orbits the data was reprocessed to provide added confidence to the solutions. Two triangles were then considered, a smaller one located in the El Golfo area within the historically collapsed caldera, and a larger one defined by benchmarks placed in Valverde, El Golfo and La Restinga, the town closest to the eruption's location, covering almost the entire Island's surface above sea level. With these two triangles the pre-eruption and post-eruption deformation suffered by El Hierro's surface will be further analyzed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brunner, D.; LaBombard, B.; Ochoukov, R.
2013-03-15
A new Retarding Field Analyzer (RFA) head has been created for the outer-midplane scanning probe system on the Alcator C-Mod tokamak. The new probe head contains back-to-back retarding field analyzers aligned with the local magnetic field. One faces 'upstream' into the field-aligned plasma flow and the other faces 'downstream' away from the flow. The RFA was created primarily to benchmark ion temperature measurements of an ion sensitive probe; it may also be used to interrogate electrons. However, its construction is robust enough to be used to measure ion and electron temperatures up to the last-closed flux surface in C-Mod. Amore » RFA probe of identical design has been attached to the side of a limiter to explore direct changes to the boundary plasma due to lower hybrid heating and current drive. Design of the high heat flux (>100 MW/m{sup 2}) handling probe and initial results are presented.« less
Worley, K C; Wiese, B A; Smith, R F
1995-09-01
BEAUTY (BLAST enhanced alignment utility) is an enhanced version of the NCBI's BLAST data base search tool that facilitates identification of the functions of matched sequences. We have created new data bases of conserved regions and functional domains for protein sequences in NCBI's Entrez data base, and BEAUTY allows this information to be incorporated directly into BLAST search results. A Conserved Regions Data Base, containing the locations of conserved regions within Entrez protein sequences, was constructed by (1) clustering the entire data base into families, (2) aligning each family using our PIMA multiple sequence alignment program, and (3) scanning the multiple alignments to locate the conserved regions within each aligned sequence. A separate Annotated Domains Data Base was constructed by extracting the locations of all annotated domains and sites from sequences represented in the Entrez, PROSITE, BLOCKS, and PRINTS data bases. BEAUTY performs a BLAST search of those Entrez sequences with conserved regions and/or annotated domains. BEAUTY then uses the information from the Conserved Regions and Annotated Domains data bases to generate, for each matched sequence, a schematic display that allows one to directly compare the relative locations of (1) the conserved regions, (2) annotated domains and sites, and (3) the locally aligned regions matched in the BLAST search. In addition, BEAUTY search results include World-Wide Web hypertext links to a number of external data bases that provide a variety of additional types of information on the function of matched sequences. This convenient integration of protein families, conserved regions, annotated domains, alignment displays, and World-Wide Web resources greatly enhances the biological informativeness of sequence similarity searches. BEAUTY searches can be performed remotely on our system using the "BCM Search Launcher" World-Wide Web pages (URL is < http:/ /gc.bcm.tmc.edu:8088/ search-launcher/launcher.html > ).
Four RNA families with functional transient structures
Zhu, Jing Yun A; Meyer, Irmtraud M
2015-01-01
Protein-coding and non-coding RNA transcripts perform a wide variety of cellular functions in diverse organisms. Several of their functional roles are expressed and modulated via RNA structure. A given transcript, however, can have more than a single functional RNA structure throughout its life, a fact which has been previously overlooked. Transient RNA structures, for example, are only present during specific time intervals and cellular conditions. We here introduce four RNA families with transient RNA structures that play distinct and diverse functional roles. Moreover, we show that these transient RNA structures are structurally well-defined and evolutionarily conserved. Since Rfam annotates one structure for each family, there is either no annotation for these transient structures or no such family. Thus, our alignments either significantly update and extend the existing Rfam families or introduce a new RNA family to Rfam. For each of the four RNA families, we compile a multiple-sequence alignment based on experimentally verified transient and dominant (dominant in terms of either the thermodynamic stability and/or attention received so far) RNA secondary structures using a combination of automated search via covariance model and manual curation. The first alignment is the Trp operon leader which regulates the operon transcription in response to tryptophan abundance through alternative structures. The second alignment is the HDV ribozyme which we extend to the 5′ flanking sequence. This flanking sequence is involved in the regulation of the transcript's self-cleavage activity. The third alignment is the 5′ UTR of the maturation protein from Levivirus which contains a transient structure that temporarily postpones the formation of the final inhibitory structure to allow translation of maturation protein. The fourth and last alignment is the SAM riboswitch which regulates the downstream gene expression by assuming alternative structures upon binding of SAM. All transient and dominant structures are mapped to our new alignments introduced here. PMID:25751035
Four RNA families with functional transient structures.
Zhu, Jing Yun A; Meyer, Irmtraud M
2015-01-01
Protein-coding and non-coding RNA transcripts perform a wide variety of cellular functions in diverse organisms. Several of their functional roles are expressed and modulated via RNA structure. A given transcript, however, can have more than a single functional RNA structure throughout its life, a fact which has been previously overlooked. Transient RNA structures, for example, are only present during specific time intervals and cellular conditions. We here introduce four RNA families with transient RNA structures that play distinct and diverse functional roles. Moreover, we show that these transient RNA structures are structurally well-defined and evolutionarily conserved. Since Rfam annotates one structure for each family, there is either no annotation for these transient structures or no such family. Thus, our alignments either significantly update and extend the existing Rfam families or introduce a new RNA family to Rfam. For each of the four RNA families, we compile a multiple-sequence alignment based on experimentally verified transient and dominant (dominant in terms of either the thermodynamic stability and/or attention received so far) RNA secondary structures using a combination of automated search via covariance model and manual curation. The first alignment is the Trp operon leader which regulates the operon transcription in response to tryptophan abundance through alternative structures. The second alignment is the HDV ribozyme which we extend to the 5' flanking sequence. This flanking sequence is involved in the regulation of the transcript's self-cleavage activity. The third alignment is the 5' UTR of the maturation protein from Levivirus which contains a transient structure that temporarily postpones the formation of the final inhibitory structure to allow translation of maturation protein. The fourth and last alignment is the SAM riboswitch which regulates the downstream gene expression by assuming alternative structures upon binding of SAM. All transient and dominant structures are mapped to our new alignments introduced here.
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
ExoLocator--an online view into genetic makeup of vertebrate proteins.
Khoo, Aik Aun; Ogrizek-Tomas, Mario; Bulovic, Ana; Korpar, Matija; Gürler, Ece; Slijepcevic, Ivan; Šikic, Mile; Mihalek, Ivana
2014-01-01
ExoLocator (http://exolocator.eopsf.org) collects in a single place information needed for comparative analysis of protein-coding exons from vertebrate species. The main source of data--the genomic sequences, and the existing exon and homology annotation--is the ENSEMBL database of completed vertebrate genomes. To these, ExoLocator adds the search for ostensibly missing exons in orthologous protein pairs across species, using an extensive computational pipeline to narrow down the search region for the candidate exons and find a suitable template in the other species, as well as state-of-the-art implementations of pairwise alignment algorithms. The resulting complements of exons are organized in a way currently unique to ExoLocator: multiple sequence alignments, both on the nucleotide and on the peptide levels, clearly indicating the exon boundaries. The alignments can be inspected in the web-embedded viewer, downloaded or used on the spot to produce an estimate of conservation within orthologous sets, or functional divergence across paralogues.
A generalized global alignment algorithm.
Huang, Xiaoqiu; Chao, Kun-Mao
2003-01-22
Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.
Optimizing high performance computing workflow for protein functional annotation.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-09-10
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Optimizing high performance computing workflow for protein functional annotation
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-01-01
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296
NASA Astrophysics Data System (ADS)
Jones, Carol L.
The number of computer-assisted education programs on the market is overwhelming science teachers all over the Michigan. Though the need is great, many teachers are reluctant to procure computer-assisted science education programs because they are unsure of the effectiveness of such programs. The Curriculum Alignment Toolbox (CAT) is a computer-based program, aligned to the Michigan Curriculum Framework's Benchmarks for Science Education and designed to supplement science instruction in Michigan middle schools. The purpose of this study was to evaluate the effectiveness of CAT in raising the standardized test scores of Michigan students. This study involved 419 students from one urban, one suburban and one rural middle school. Data on these students was collected from 4 sources: (1) the 8th grade Michigan Education Assessment Program (MEAP) test, (2) a 9 question, 5-point Likert-type scale student survey, (3) 4 open-response student survey questions and (4) classroom observations. Results of this study showed that the experimental group of 226 students who utilized the CAT program in addition to traditional instruction did significantly better on the Science MEAP test than the control group of 193 students who received only traditional instruction. The study also showed that the urban students from a "high needs" school seemed to benefit most from the program. Additionally, though both genders and all identified ethnic groups benefited from the program, males benefited more than females and whites, blacks and Asian/Pacific Islander students benefited more than Hispanic and multi-racial students. The CAT program's success helping raise the middle school MEAP scores may well be due to some of its components. CAT provided students with game-like experiences all based on the benchmarks required for science education and upon which the MEAP test is based. The program also provided visual and auditory stimulation as well as numerous references which students indicated they enjoyed. Additionally, as best-practice, the questioning in all the gaming within CAT did not allow a student to continue until he/she had given the correct answer, thus reinforcing the correct response.
Self-aligned blocking integration demonstration for critical sub-40nm pitch Mx level patterning
NASA Astrophysics Data System (ADS)
Raley, Angélique; Mohanty, Nihar; Sun, Xinghua; Farrell, Richard A.; Smith, Jeffrey T.; Ko, Akiteru; Metz, Andrew W.; Biolsi, Peter; Devilliers, Anton
2017-04-01
Multipatterning has enabled continued scaling of chip technology at the 28nm node and beyond. Selfaligned double patterning (SADP) and self-aligned quadruple patterning (SAQP) as well as Litho- Etch/Litho-Etch (LELE) iterations are widely used in the semiconductor industry to enable patterning at sub 193 immersion lithography resolutions for layers such as FIN, Gate and critical Metal lines. Multipatterning requires the use of multiple masks which is costly and increases process complexity as well as edge placement error variation driven mostly by overlay. To mitigate the strict overlay requirements for advanced technology nodes (7nm and below), a self-aligned blocking integration is desirable. This integration trades off the overlay requirement for an etch selectivity requirement and enables the cut mask overlay tolerance to be relaxed from half pitch to three times half pitch. Selfalignement has become the latest trend to enable scaling and self-aligned integrations are being pursued and investigated for various critical layers such as contact, via, metal patterning. In this paper we propose and demonstrate a low cost flexible self-aligned blocking strategy for critical metal layer patterning for 7nm and beyond from mask assembly to low -K dielectric etch. The integration is based on a 40nm pitch SADP flow with 2 cut masks compatible with either cut or block integration and employs dielectric films widely used in the back end of the line. As a consequence this approach is compatible with traditional etch, deposition and cleans tools that are optimized for dielectric etches. We will review the critical steps and selectivities required to enable this integration along with bench-marking of each integration option (cut vs. block).
Chaput, Ludovic; Martinez-Sanz, Juan; Saettel, Nicolas; Mouawad, Liliane
2016-01-01
In a structure-based virtual screening, the choice of the docking program is essential for the success of a hit identification. Benchmarks are meant to help in guiding this choice, especially when undertaken on a large variety of protein targets. Here, the performance of four popular virtual screening programs, Gold, Glide, Surflex and FlexX, is compared using the Directory of Useful Decoys-Enhanced database (DUD-E), which includes 102 targets with an average of 224 ligands per target and 50 decoys per ligand, generated to avoid biases in the benchmarking. Then, a relationship between these program performances and the properties of the targets or the small molecules was investigated. The comparison was based on two metrics, with three different parameters each. The BEDROC scores with α = 80.5, indicated that, on the overall database, Glide succeeded (score > 0.5) for 30 targets, Gold for 27, FlexX for 14 and Surflex for 11. The performance did not depend on the hydrophobicity nor the openness of the protein cavities, neither on the families to which the proteins belong. However, despite the care in the construction of the DUD-E database, the small differences that remain between the actives and the decoys likely explain the successes of Gold, Surflex and FlexX. Moreover, the similarity between the actives of a target and its crystal structure ligand seems to be at the basis of the good performance of Glide. When all targets with significant biases are removed from the benchmarking, a subset of 47 targets remains, for which Glide succeeded for only 5 targets, Gold for 4 and FlexX and Surflex for 2. The performance dramatic drop of all four programs when the biases are removed shows that we should beware of virtual screening benchmarks, because good performances may be due to wrong reasons. Therefore, benchmarking would hardly provide guidelines for virtual screening experiments, despite the tendency that is maintained, i.e., Glide and Gold display better performance than FlexX and Surflex. We recommend to always use several programs and combine their results. Graphical AbstractSummary of the results obtained by virtual screening with the four programs, Glide, Gold, Surflex and FlexX, on the 102 targets of the DUD-E database. The percentage of targets with successful results, i.e., with BDEROC(α = 80.5) > 0.5, when the entire database is considered are in Blue, and when targets with biased chemical libraries are removed are in Red.
Comparative Protein Structure Modeling Using MODELLER.
Webb, Benjamin; Sali, Andrej
2014-09-08
Functional characterization of a protein sequence is one of the most frequent problems in biology. This task is usually facilitated by accurate three-dimensional (3-D) structure of the studied protein. In the absence of an experimentally determined structure, comparative or homology modeling can sometimes provide a useful 3-D model for a protein that is related to at least one known protein structure. Comparative modeling predicts the 3-D structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. Copyright © 2014 John Wiley & Sons, Inc.
Holm, Liisa; Laakso, Laura M
2016-07-08
The Dali server (http://ekhidna2.biocenter.helsinki.fi/dali) is a network service for comparing protein structures in 3D. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. The Dali server has been running in various places for over 20 years and is used routinely by crystallographers on newly solved structures. The latest update of the server provides enhanced analytics for the study of sequence and structure conservation. The server performs three types of structure comparisons: (i) Protein Data Bank (PDB) search compares one query structure against those in the PDB and returns a list of similar structures; (ii) pairwise comparison compares one query structure against a list of structures specified by the user; and (iii) all against all structure comparison returns a structural similarity matrix, a dendrogram and a multidimensional scaling projection of a set of structures specified by the user. Structural superimpositions are visualized using the Java-free WebGL viewer PV. The structural alignment view is enhanced by sequence similarity searches against Uniprot. The combined structure-sequence alignment information is compressed to a stack of aligned sequence logos. In the stack, each structure is structurally aligned to the query protein and represented by a sequence logo. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
NASA Technical Reports Server (NTRS)
Mills, I.; Cohen, C. R.; Kamal, K.; Li, G.; Shin, T.; Du, W.; Sumpio, B. E.
1997-01-01
Smooth muscle cell (SMC) phenotype can be altered by physical forces as demonstrated by cyclic strain-induced changes in proliferation, orientation, and secretion of macromolecules. However, the magnitude of strain required and the intracellular coupling pathways remain ill defined. To examine the strain requirements for SMC proliferation, we selectively seeded bovine aortic SMC either on the center or periphery of silastic membranes which were deformed with 150 mm Hg vacuum (0-7% center; 7-24% periphery). SMC located in either the center or peripheral regions showed enhanced proliferation compared to cells grown under the absence of cyclic strain. Moreover, SMC located in the center region demonstrated significantly (P < 0.005) greater proliferation as compared to those in the periphery. In contrast, SMC exposed to high strain (7-24%) demonstrated alignment perpendicular to the strain gradient, whereas SMC in the center (0-7%) remained aligned randomly. To determine the mechanisms of these phenomena, we examined the effect of cyclic strain on bovine aortic SMC signaling pathways. We observed strain-induced stimulation of the cyclic AMP pathway including adenylate cyclase activity and cyclic AMP accumulation. In addition, exposure of SMC to cyclic strain caused a significant increase in protein kinase C (PKC) activity and enzyme translocation from the cytosol to a particulate fraction. Further study was conducted to examine the effect of strain magnitude on signaling, particularly protein kinase A (PKA) activity as well as cAMP response element (CRE) binding protein levels. We observed significantly (P < 0.05) greater PKA activity and CRE binding protein levels in SMC located in the center as compared to the peripheral region. However, inhibition of PKA (with 10 microM Rp-cAMP) or PKC (with 5-20 ng/ml staurosporine) failed to alter either the strain-induced increase in SMC proliferation or alignment. These data characterize the strain determinants for activation of SMC proliferation and alignment. Although strain activated both the AC/cAMP/PKA and the PKC pathways in SMC, singular inhibition of PKA and PKC failed to prevent strain-induced alignment and proliferation, suggesting either their lack of involvement or the multifactorial nature of these responses.
O'Donoghue, Patrick; Luthey-Schulten, Zaida
2005-02-25
We present a new algorithm, based on the multidimensional QR factorization, to remove redundancy from a multiple structural alignment by choosing representative protein structures that best preserve the phylogenetic tree topology of the homologous group. The classical QR factorization with pivoting, developed as a fast numerical solution to eigenvalue and linear least-squares problems of the form Ax=b, was designed to re-order the columns of A by increasing linear dependence. Removing the most linear dependent columns from A leads to the formation of a minimal basis set which well spans the phase space of the problem at hand. By recasting the problem of redundancy in multiple structural alignments into this framework, in which the matrix A now describes the multiple alignment, we adapted the QR factorization to produce a minimal basis set of protein structures which best spans the evolutionary (phase) space. The non-redundant and representative profiles obtained from this procedure, termed evolutionary profiles, are shown in initial results to outperform well-tested profiles in homology detection searches over a large sequence database. A measure of structural similarity between homologous proteins, Q(H), is presented. By properly accounting for the effect and presence of gaps, a phylogenetic tree computed using this metric is shown to be congruent with the maximum-likelihood sequence-based phylogeny. The results indicate that evolutionary information is indeed recoverable from the comparative analysis of protein structure alone. Applications of the QR ordering and this structural similarity metric to analyze the evolution of structure among key, universally distributed proteins involved in translation, and to the selection of representatives from an ensemble of NMR structures are also discussed.
Pinto, Joshua G. A.; Jones, David G.; Williams, C. Kate; Murphy, Kathryn M.
2015-01-01
Although many potential neuroplasticity based therapies have been developed in the lab, few have translated into established clinical treatments for human neurologic or neuropsychiatric diseases. Animal models, especially of the visual system, have shaped our understanding of neuroplasticity by characterizing the mechanisms that promote neural changes and defining timing of the sensitive period. The lack of knowledge about development of synaptic plasticity mechanisms in human cortex, and about alignment of synaptic age between animals and humans, has limited translation of neuroplasticity therapies. In this study, we quantified expression of a set of highly conserved pre- and post-synaptic proteins (Synapsin, Synaptophysin, PSD-95, Gephyrin) and found that synaptic development in human primary visual cortex (V1) continues into late childhood. Indeed, this is many years longer than suggested by neuroanatomical studies and points to a prolonged sensitive period for plasticity in human sensory cortex. In addition, during childhood we found waves of inter-individual variability that are different for the four proteins and include a stage during early development (<1 year) when only Gephyrin has high inter-individual variability. We also found that pre- and post-synaptic protein balances develop quickly, suggesting that maturation of certain synaptic functions happens within the 1 year or 2 of life. A multidimensional analysis (principle component analysis) showed that most of the variance was captured by the sum of the four synaptic proteins. We used that sum to compare development of human and rat visual cortex and identified a simple linear equation that provides robust alignment of synaptic age between humans and rats. Alignment of synaptic ages is important for age-appropriate targeting and effective translation of neuroplasticity therapies from the lab to the clinic. PMID:25729353
The evolution of function within the Nudix homology clan
Srouji, John R.; Xu, Anting; Park, Annsea; Kirsch, Jack F.
2017-01-01
ABSTRACT The Nudix homology clan encompasses over 80,000 protein domains from all three domains of life, defined by homology to each other. Proteins with a domain from this clan fall into four general functional classes: pyrophosphohydrolases, isopentenyl diphosphate isomerases (IDIs), adenine/guanine mismatch‐specific adenine glycosylases (A/G‐specific adenine glycosylases), and nonenzymatic activities such as protein/protein interaction and transcriptional regulation. The largest group, pyrophosphohydrolases, encompasses more than 100 distinct hydrolase specificities. To understand the evolution of this vast number of activities, we assembled and analyzed experimental and structural data for 205 Nudix proteins collected from the literature. We corrected erroneous functions or provided more appropriate descriptions for 53 annotations described in the Gene Ontology Annotation database in this family, and propose 275 new experimentally‐based annotations. We manually constructed a structure‐guided sequence alignment of 78 Nudix proteins. Using the structural alignment as a seed, we then made an alignment of 347 “select” Nudix homology domains, curated from structurally determined, functionally characterized, or phylogenetically important Nudix domains. Based on our review of Nudix pyrophosphohydrolase structures and specificities, we further analyzed a loop region downstream of the Nudix hydrolase motif previously shown to contact the substrate molecule and possess known functional motifs. This loop region provides a potential structural basis for the functional radiation and evolution of substrate specificity within the hydrolase family. Finally, phylogenetic analyses of the 347 select protein domains and of the complete Nudix homology clan revealed general monophyly with regard to function and a few instances of probable homoplasy. Proteins 2017; 85:775–811. © 2016 Wiley Periodicals, Inc. PMID:27936487
GraphCrunch 2: Software tool for network modeling, alignment and clustering.
Kuchaiev, Oleksii; Stevanović, Aleksandar; Hayes, Wayne; Pržulj, Nataša
2011-01-19
Recent advancements in experimental biotechnology have produced large amounts of protein-protein interaction (PPI) data. The topology of PPI networks is believed to have a strong link to their function. Hence, the abundance of PPI data for many organisms stimulates the development of computational techniques for the modeling, comparison, alignment, and clustering of networks. In addition, finding representative models for PPI networks will improve our understanding of the cell just as a model of gravity has helped us understand planetary motion. To decide if a model is representative, we need quantitative comparisons of model networks to real ones. However, exact network comparison is computationally intractable and therefore several heuristics have been used instead. Some of these heuristics are easily computable "network properties," such as the degree distribution, or the clustering coefficient. An important special case of network comparison is the network alignment problem. Analogous to sequence alignment, this problem asks to find the "best" mapping between regions in two networks. It is expected that network alignment might have as strong an impact on our understanding of biology as sequence alignment has had. Topology-based clustering of nodes in PPI networks is another example of an important network analysis problem that can uncover relationships between interaction patterns and phenotype. We introduce the GraphCrunch 2 software tool, which addresses these problems. It is a significant extension of GraphCrunch which implements the most popular random network models and compares them with the data networks with respect to many network properties. Also, GraphCrunch 2 implements the GRAph ALigner algorithm ("GRAAL") for purely topological network alignment. GRAAL can align any pair of networks and exposes large, dense, contiguous regions of topological and functional similarities far larger than any other existing tool. Finally, GraphCruch 2 implements an algorithm for clustering nodes within a network based solely on their topological similarities. Using GraphCrunch 2, we demonstrate that eukaryotic and viral PPI networks may belong to different graph model families and show that topology-based clustering can reveal important functional similarities between proteins within yeast and human PPI networks. GraphCrunch 2 is a software tool that implements the latest research on biological network analysis. It parallelizes computationally intensive tasks to fully utilize the potential of modern multi-core CPUs. It is open-source and freely available for research use. It runs under the Windows and Linux platforms.
Ferrada, Evandro; Vergara, Ismael A; Melo, Francisco
2007-01-01
The correct discrimination between native and near-native protein conformations is essential for achieving accurate computer-based protein structure prediction. However, this has proven to be a difficult task, since currently available physical energy functions, empirical potentials and statistical scoring functions are still limited in achieving this goal consistently. In this work, we assess and compare the ability of different full atom knowledge-based potentials to discriminate between native protein structures and near-native protein conformations generated by comparative modeling. Using a benchmark of 152 near-native protein models and their corresponding native structures that encompass several different folds, we demonstrate that the incorporation of close non-bonded pairwise atom terms improves the discriminating power of the empirical potentials. Since the direct and unbiased derivation of close non-bonded terms from current experimental data is not possible, we obtained and used those terms from the corresponding pseudo-energy functions of a non-local knowledge-based potential. It is shown that this methodology significantly improves the discrimination between native and near-native protein conformations, suggesting that a proper description of close non-bonded terms is important to achieve a more complete and accurate description of native protein conformations. Some external knowledge-based energy functions that are widely used in model assessment performed poorly, indicating that the benchmark of models and the specific discrimination task tested in this work constitutes a difficult challenge.
Multilabel learning via random label selection for protein subcellular multilocations prediction.
Wang, Xiao; Li, Guo-Zheng
2013-01-01
Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multilocation proteins to multiple proteins with single location, which does not take correlations among different subcellular locations into account. In this paper, a novel method named random label selection (RALS) (multilabel learning via RALS), which extends the simple binary relevance (BR) method, is proposed to learn from multilocation proteins in an effective and efficient way. RALS does not explicitly find the correlations among labels, but rather implicitly attempts to learn the label correlations from data by augmenting original feature space with randomly selected labels as its additional input features. Through the fivefold cross-validation test on a benchmark data set, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark data sets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multilocations of proteins. The prediction web server is available at >http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.
Measurement and validation of benchmark-quality thick-target tungsten X-ray spectra below 150 kVp.
Mercier, J R; Kopp, D T; McDavid, W D; Dove, S B; Lancaster, J L; Tucker, D M
2000-11-01
Pulse-height distributions of two constant potential X-ray tubes with fixed anode tungsten targets were measured and unfolded. The measurements employed quantitative alignment of the beam, the use of two different semiconductor detectors (high-purity germanium and cadmium-zinc-telluride), two different ion chamber systems with beam-specific calibration factors, and various filter and tube potential combinations. Monte Carlo response matrices were generated for each detector for unfolding the pulse-height distributions into spectra incident on the detectors. These response matrices were validated for the low error bars assigned to the data. A significant aspect of the validation of spectra, and a detailed characterization of the X-ray tubes, involved measuring filtered and unfiltered beams at multiple tube potentials (30-150 kVp). Full corrections to ion chamber readings were employed to convert normalized fluence spectra into absolute fluence spectra. The characterization of fixed anode pitting and its dominance over exit window plating and/or detector dead layer was determined. An Appendix of tabulated benchmark spectra with assigned error ranges was developed for future reference.
Myhrer, T; Evans, J L; Haugen, H K; Gorman, C; Kavanagh, Y; Cameron, A B
2016-08-01
Dental technology programmes of study must prepare students to practice in a broad range of contemporary workplaces. Currently, there is limited evidence to benchmark dental technology education - locally, nationally or internationally. This research aims to improve consistency, transparency and portability of dental technology qualifications across three countries. Data were accessed from open-source curriculum documents and five calibrated assessment items. Three institutions collaborated with Oslo and Akershus University College, Norway; Trinity College Dublin, Ireland; and Griffith University, Australia. From these, 29-44 students completed 174 assessments. The curricula reflect the community needs of each country and display common themes that underpin professional dental technology practice. Assessment results differed between institutions but no more than a normal distribution. Face-to-face assessment moderation was critical to achieve consistency. This collaborative research has led to the development of a set of guidelines for other dental technology education providers interested in developing or aligning courses internationally to enhance the portability of qualifications. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Deng, Lei; Wu, Hongjie; Liu, Chuyao; Zhan, Weihua; Zhang, Jingpu
2018-06-01
Long non-coding RNAs (lncRNAs) are involved in many biological processes, such as immune response, development, differentiation and gene imprinting and are associated with diseases and cancers. But the functions of the vast majority of lncRNAs are still unknown. Predicting the biological functions of lncRNAs is one of the key challenges in the post-genomic era. In our work, We first build a global network including a lncRNA similarity network, a lncRNA-protein association network and a protein-protein interaction network according to the expressions and interactions, then extract the topological feature vectors of the global network. Using these features, we present an SVM-based machine learning approach, PLNRGO, to annotate human lncRNAs. In PLNRGO, we construct a training data set according to the proteins with GO annotations and train a binary classifier for each GO term. We assess the performance of PLNRGO on our manually annotated lncRNA benchmark and a protein-coding gene benchmark with known functional annotations. As a result, the performance of our method is significantly better than that of other state-of-the-art methods in terms of maximum F-measure and coverage. Copyright © 2018 Elsevier Ltd. All rights reserved.
Anekthanakul, Krittima; Hongsthong, Apiradee; Senachak, Jittisak; Ruengjitchatchawalya, Marasri
2018-04-20
Bioactive peptides, including biological sources-derived peptides with different biological activities, are protein fragments that influence the functions or conditions of organisms, in particular humans and animals. Conventional methods of identifying bioactive peptides are time-consuming and costly. To quicken the processes, several bioinformatics tools are recently used to facilitate screening of the potential peptides prior their activity assessment in vitro and/or in vivo. In this study, we developed an efficient computational method, SpirPep, which offers many advantages over the currently available tools. The SpirPep web application tool is a one-stop analysis and visualization facility to assist bioactive peptide discovery. The tool is equipped with 15 customized enzymes and 1-3 miscleavage options, which allows in silico digestion of protein sequences encoded by protein-coding genes from single, multiple, or genome-wide scaling, and then directly classifies the peptides by bioactivity using an in-house database that contains bioactive peptides collected from 13 public databases. With this tool, the resulting peptides are categorized by each selected enzyme, and shown in a tabular format where the peptide sequences can be tracked back to their original proteins. The developed tool and webpages are coded in PHP and HTML with CSS/JavaScript. Moreover, the tool allows protein-peptide alignment visualization by Generic Genome Browser (GBrowse) to display the region and details of the proteins and peptides within each parameter, while considering digestion design for the desirable bioactivity. SpirPep is efficient; it takes less than 20 min to digest 3000 proteins (751,860 amino acids) with 15 enzymes and three miscleavages for each enzyme, and only a few seconds for single enzyme digestion. Obviously, the tool identified more bioactive peptides than that of the benchmarked tool; an example of validated pentapeptide (FLPIL) from LC-MS/MS was demonstrated. The web and database server are available at http://spirpepapp.sbi.kmutt.ac.th . SpirPep, a web-based bioactive peptide discovery application, is an in silico-based tool with an overview of the results. The platform is a one-stop analysis and visualization facility; and offers advantages over the currently available tools. This tool may be useful for further bioactivity analysis and the quantitative discovery of desirable peptides.
What can one learn from experiments about the elusive transition state?
Chang, Iksoo; Cieplak, Marek; Banavar, Jayanth R.; Maritan, Amos
2004-01-01
We present the results of an exact analysis of a model energy landscape of a protein to clarify the idea of the transition state and the physical meaning of the φ values determined in protein engineering experiments. We benchmark our findings to various theoretical approaches proposed in the literature for the identification and characterization of the transition state. PMID:15295118
Superposition and alignment of labeled point clouds.
Fober, Thomas; Glinca, Serghei; Klebe, Gerhard; Hüllermeier, Eyke
2011-01-01
Geometric objects are often represented approximately in terms of a finite set of points in three-dimensional euclidean space. In this paper, we extend this representation to what we call labeled point clouds. A labeled point cloud is a finite set of points, where each point is not only associated with a position in three-dimensional space, but also with a discrete class label that represents a specific property. This type of model is especially suitable for modeling biomolecules such as proteins and protein binding sites, where a label may represent an atom type or a physico-chemical property. Proceeding from this representation, we address the question of how to compare two labeled points clouds in terms of their similarity. Using fuzzy modeling techniques, we develop a suitable similarity measure as well as an efficient evolutionary algorithm to compute it. Moreover, we consider the problem of establishing an alignment of the structures in the sense of a one-to-one correspondence between their basic constituents. From a biological point of view, alignments of this kind are of great interest, since mutually corresponding molecular constituents offer important information about evolution and heredity, and can also serve as a means to explain a degree of similarity. In this paper, we therefore develop a method for computing pairwise or multiple alignments of labeled point clouds. To this end, we proceed from an optimal superposition of the corresponding point clouds and construct an alignment which is as much as possible in agreement with the neighborhood structure established by this superposition. We apply our methods to the structural analysis of protein binding sites.
Why Is There a Glass Ceiling for Threading Based Protein Structure Prediction Methods?
Skolnick, Jeffrey; Zhou, Hongyi
2017-04-20
Despite their different implementations, comparison of the best threading approaches to the prediction of evolutionary distant protein structures reveals that they tend to succeed or fail on the same protein targets. This is true despite the fact that the structural template library has good templates for all cases. Thus, a key question is why are certain protein structures threadable while others are not. Comparison with threading results on a set of artificial sequences selected for stability further argues that the failure of threading is due to the nature of the protein structures themselves. Using a new contact map based alignment algorithm, we demonstrate that certain folds are highly degenerate in that they can have very similar coarse grained fractions of native contacts aligned and yet differ significantly from the native structure. For threadable proteins, this is not the case. Thus, contemporary threading approaches appear to have reached a plateau, and new approaches to structure prediction are required.
RaptorX server: a resource for template-based protein structure modeling.
Källberg, Morten; Margaryan, Gohar; Wang, Sheng; Ma, Jianzhu; Xu, Jinbo
2014-01-01
Assigning functional properties to a newly discovered protein is a key challenge in modern biology. To this end, computational modeling of the three-dimensional atomic arrangement of the amino acid chain is often crucial in determining the role of the protein in biological processes. We present a community-wide web-based protocol, RaptorX server ( http://raptorx.uchicago.edu ), for automated protein secondary structure prediction, template-based tertiary structure modeling, and probabilistic alignment sampling.Given a target sequence, RaptorX server is able to detect even remotely related template sequences by means of a novel nonlinear context-specific alignment potential and probabilistic consistency algorithm. Using the protocol presented here it is thus possible to obtain high-quality structural models for many target protein sequences when only distantly related protein domains have experimentally solved structures. At present, RaptorX server can perform secondary and tertiary structure prediction of a 200 amino acid target sequence in approximately 30 min.
Node fingerprinting: an efficient heuristic for aligning biological networks.
Radu, Alex; Charleston, Michael
2014-10-01
With the continuing increase in availability of biological data and improvements to biological models, biological network analysis has become a promising area of research. An emerging technique for the analysis of biological networks is through network alignment. Network alignment has been used to calculate genetic distance, similarities between regulatory structures, and the effect of external forces on gene expression, and to depict conditional activity of expression modules in cancer. Network alignment is algorithmically complex, and therefore we must rely on heuristics, ideally as efficient and accurate as possible. The majority of current techniques for network alignment rely on precomputed information, such as with protein sequence alignment, or on tunable network alignment parameters, which may introduce an increased computational overhead. Our presented algorithm, which we call Node Fingerprinting (NF), is appropriate for performing global pairwise network alignment without precomputation or tuning, can be fully parallelized, and is able to quickly compute an accurate alignment between two biological networks. It has performed as well as or better than existing algorithms on biological and simulated data, and with fewer computational resources. The algorithmic validation performed demonstrates the low computational resource requirements of NF.
Muth, Thilo; García-Martín, Juan A; Rausell, Antonio; Juan, David; Valencia, Alfonso; Pazos, Florencio
2012-02-15
We have implemented in a single package all the features required for extracting, visualizing and manipulating fully conserved positions as well as those with a family-dependent conservation pattern in multiple sequence alignments. The program allows, among other things, to run different methods for extracting these positions, combine the results and visualize them in protein 3D structures and sequence spaces. JDet is a multiplatform application written in Java. It is freely available, including the source code, at http://csbg.cnb.csic.es/JDet. The package includes two of our recently developed programs for detecting functional positions in protein alignments (Xdet and S3Det), and support for other methods can be added as plug-ins. A help file and a guided tutorial for JDet are also available.
Rebelling for a Reason: Protein Structural “Outliers”
Arumugam, Gandhimathi; Nair, Anu G.; Hariharaputran, Sridhar; Ramanathan, Sowdhamini
2013-01-01
Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural differences which are functionally relevant. This paper analyses the impact of structural divergence on function for multi-member superfamilies, selected from the PASS2 superfamily alignment database. Functional annotations within superfamilies, with structural outliers or ‘rebels’, are discussed in the context of structural variations. Overall, these data reinforce the idea that functional similarities cannot be extrapolated from mere structural conservation. The implication for fold-function prediction is that the functional annotations can only be inherited with very careful consideration, especially at low sequence identities. PMID:24073209
Aligning nanodiscs at the air-water interface, a neutron reflectivity study.
Wadsäter, Maria; Simonsen, Jens B; Lauridsen, Torsten; Tveten, Erlend Grytli; Naur, Peter; Bjørnholm, Thomas; Wacklin, Hanna; Mortensen, Kell; Arleth, Lise; Feidenhans'l, Robert; Cárdenas, Marité
2011-12-20
Nanodiscs are self-assembled nanostructures composed of a belt protein and a small patch of lipid bilayer, which can solubilize membrane proteins in a lipid bilayer environment. We present a method for the alignment of a well-defined two-dimensional layer of nanodiscs at the air-water interface by careful design of an insoluble surfactant monolayer at the surface. We used neutron reflectivity to demonstrate the feasibility of this approach and to elucidate the structure of the nanodisc layer. The proof of concept is hereby presented with the use of nanodiscs composed of a mixture of two different lipid (DMPC and DMPG) types to obtain a net overall negative charge of the nanodiscs. We find that the nanodisc layer has a thickness or 40.9 ± 2.6 Å with a surface coverage of 66 ± 4%. This layer is located about 15 Å below a cationic surfactant layer at the air-water interface. The high level of organization within the nanodiscs layer is reflected by a low interfacial roughness (~4.5 Å) found. The use of the nanodisc as a biomimetic model of the cell membrane allows for studies of single membrane proteins isolated in a confined lipid environment. The 2D alignment of nanodiscs could therefore enable studies of high-density layers containing membrane proteins that, in contrast to membrane proteins reconstituted in a continuous lipid bilayer, remain isolated from influences of neighboring membrane proteins within the layer. © 2011 American Chemical Society
AllergenFP: allergenicity prediction by descriptor fingerprints.
Dimitrov, Ivan; Naneva, Lyudmila; Doytchinova, Irini; Bangov, Ivan
2014-03-15
Allergenicity, like antigenicity and immunogenicity, is a property encoded linearly and non-linearly, and therefore the alignment-based approaches are not able to identify this property unambiguously. A novel alignment-free descriptor-based fingerprint approach is presented here and applied to identify allergens and non-allergens. The approach was implemented into a four step algorithm. Initially, the protein sequences are described by amino acid principal properties as hydrophobicity, size, relative abundance, helix and β-strand forming propensities. Then, the generated strings of different length are converted into vectors with equal length by auto- and cross-covariance (ACC). The vectors were transformed into binary fingerprints and compared in terms of Tanimoto coefficient. The approach was applied to a set of 2427 known allergens and 2427 non-allergens and identified correctly 88% of them with Matthews correlation coefficient of 0.759. The descriptor fingerprint approach presented here is universal. It could be applied for any classification problem in computational biology. The set of E-descriptors is able to capture the main structural and physicochemical properties of amino acids building the proteins. The ACC transformation overcomes the main problem in the alignment-based comparative studies arising from the different length of the aligned protein sequences. The conversion of protein ACC values into binary descriptor fingerprints allows similarity search and classification. The algorithm described in the present study was implemented in a specially designed Web site, named AllergenFP (FP stands for FingerPrint). AllergenFP is written in Python, with GIU in HTML. It is freely accessible at http://ddg-pharmfac.net/Allergen FP. idoytchinova@pharmfac.net or ivanbangov@shu-bg.net.
Improving pairwise comparison of protein sequences with domain co-occurrence
Gascuel, Olivier
2018-01-01
Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence PMID:29293498
2010-01-01
Background Multiple sequence alignments are used to study gene or protein function, phylogenetic relations, genome evolution hypotheses and even gene polymorphisms. Virtually without exception, all available tools focus on conserved segments or residues. Small divergent regions, however, are biologically important for specific quantitative polymerase chain reaction, genotyping, molecular markers and preparation of specific antibodies, and yet have received little attention. As a consequence, they must be selected empirically by the researcher. AlignMiner has been developed to fill this gap in bioinformatic analyses. Results AlignMiner is a Web-based application for detection of conserved and divergent regions in alignments of conserved sequences, focusing particularly on divergence. It accepts alignments (protein or nucleic acid) obtained using any of a variety of algorithms, which does not appear to have a significant impact on the final results. AlignMiner uses different scoring methods for assessing conserved/divergent regions, Entropy being the method that provides the highest number of regions with the greatest length, and Weighted being the most restrictive. Conserved/divergent regions can be generated either with respect to the consensus sequence or to one master sequence. The resulting data are presented in a graphical interface developed in AJAX, which provides remarkable user interaction capabilities. Users do not need to wait until execution is complete and can.even inspect their results on a different computer. Data can be downloaded onto a user disk, in standard formats. In silico and experimental proof-of-concept cases have shown that AlignMiner can be successfully used to designing specific polymerase chain reaction primers as well as potential epitopes for antibodies. Primer design is assisted by a module that deploys several oligonucleotide parameters for designing primers "on the fly". Conclusions AlignMiner can be used to reliably detect divergent regions via several scoring methods that provide different levels of selectivity. Its predictions have been verified by experimental means. Hence, it is expected that its usage will save researchers' time and ensure an objective selection of the best-possible divergent region when closely related sequences are analysed. AlignMiner is freely available at http://www.scbi.uma.es/alignminer. PMID:20525162
Hybrid photodetector for single-molecule spectroscopy and microscopy
Michalet, X.; Cheng, Adrian; Antelman, Joshua; Suyama, Motohiro; Arisaka, Katsushi; Weiss, Shimon
2011-01-01
We report benchmark tests of a new single-photon counting detector based on a GaAsP photocathode and an electron-bombarded avalanche photodiode developed by Hamamatsu Photonics. We compare its performance with those of standard Geiger-mode avalanche photodiodes. We show its advantages for FCS due to the absence of after-pulsing and for fluorescence lifetime measurements due to its excellent time resolution. Its large sensitive area also greatly simplifies setup alignment. Its spectral sensitivity being similar to that of recently introduced CMOS SPADs, this new detector could become a valuable tool for single-molecule fluorescence measurements, as well as for many other applications. PMID:21822361
Cloud4Psi: cloud computing for 3D protein structure similarity searching.
Mrozek, Dariusz; Małysiak-Mrozek, Bożena; Kłapciński, Artur
2014-10-01
Popular methods for 3D protein structure similarity searching, especially those that generate high-quality alignments such as Combinatorial Extension (CE) and Flexible structure Alignment by Chaining Aligned fragment pairs allowing Twists (FATCAT) are still time consuming. As a consequence, performing similarity searching against large repositories of structural data requires increased computational resources that are not always available. Cloud computing provides huge amounts of computational power that can be provisioned on a pay-as-you-go basis. We have developed the cloud-based system that allows scaling of the similarity searching process vertically and horizontally. Cloud4Psi (Cloud for Protein Similarity) was tested in the Microsoft Azure cloud environment and provided good, almost linearly proportional acceleration when scaled out onto many computational units. Cloud4Psi is available as Software as a Service for testing purposes at: http://cloud4psi.cloudapp.net/. For source code and software availability, please visit the Cloud4Psi project home page at http://zti.polsl.pl/dmrozek/science/cloud4psi.htm. © The Author 2014. Published by Oxford University Press.
Cloud4Psi: cloud computing for 3D protein structure similarity searching
Mrozek, Dariusz; Małysiak-Mrozek, Bożena; Kłapciński, Artur
2014-01-01
Summary: Popular methods for 3D protein structure similarity searching, especially those that generate high-quality alignments such as Combinatorial Extension (CE) and Flexible structure Alignment by Chaining Aligned fragment pairs allowing Twists (FATCAT) are still time consuming. As a consequence, performing similarity searching against large repositories of structural data requires increased computational resources that are not always available. Cloud computing provides huge amounts of computational power that can be provisioned on a pay-as-you-go basis. We have developed the cloud-based system that allows scaling of the similarity searching process vertically and horizontally. Cloud4Psi (Cloud for Protein Similarity) was tested in the Microsoft Azure cloud environment and provided good, almost linearly proportional acceleration when scaled out onto many computational units. Availability and implementation: Cloud4Psi is available as Software as a Service for testing purposes at: http://cloud4psi.cloudapp.net/. For source code and software availability, please visit the Cloud4Psi project home page at http://zti.polsl.pl/dmrozek/science/cloud4psi.htm. Contact: dariusz.mrozek@polsl.pl PMID:24930141
Hamilton, Heidi E; Nelson, Meaghan; Martin, Paul; Cotler, Scott J
2006-04-01
Providers need to communicate projected response rates effectively to enable patients with hepatitis C virus to make informed decisions about therapy. This study used interactional sociolinguistics (1) to evaluate how gastroenterologists and allied health professionals communicate information regarding response rates to antiviral therapy, (2) to determine how these discussions relate to where the patient is in the continuum of evaluation and treatment, (3) to assess whether patients were aligned with providers in their perceptions of response rates after office visits, and (4) to identify factors that improve provider-patient alignment. Gastroenterologists, allied health professionals, and patients with hepatitis C virus were videotaped and audiotaped during regularly scheduled visits. Postvisit interviews were conducted separately with patients and providers. Visits and postvisits were transcribed and analyzed using validated sociolinguistic techniques. The phase of hepatitis C virus treatment shaped the benchmarks of response talk, although across the treatment continuum providers overwhelmingly made strategic use of positive statistics, providing motivation. In postvisit interviews, 55% of providers and patients were aligned on response rates. Patients with a favorable outcome and patients who asked response-related questions in the visit were more likely to be aligned with providers. Areas identified for improvement included the tendency to discuss response rates before an individualized assessment could be made, balancing motivation and accuracy, and assessing the patient's perspective before delivering any bad news, if necessary. Sociolinguistic analysis provides a powerful tool to evaluate provider-patient interactions and to identify ways to improve in-office communication regarding antiviral therapy.
NASA Astrophysics Data System (ADS)
Stern, Luli
2002-11-01
Assessment influences every level of the education system and is one of the most crucial catalysts for reform in science curriculum and instruction. Teachers, administrators, and others who choose, assemble, or develop assessments face the difficulty of judging whether tasks are truly aligned with national or state standards and whether they are effective in revealing what students actually know. Project 2061 of the American Association for the Advancement of Science has developed and field-tested a procedure for analyzing curriculum materials, including their assessments, in terms of how well they are likely to contribute to the attainment of benchmarks and standards. With respect to assessment in curriculum materials, this procedure evaluates whether this assessment has the potential to reveal whether students have attained specific ideas in benchmarks and standards and whether information gained from students' responses can be used to inform subsequent instruction. Using this procedure, Project 2061 had produced a database of analytical reports on nine widely used science middle school curriculum materials. The analysis of assessments included in these materials shows that whereas currently available materials devote significant sections in their instruction to ideas included in national standards documents, students are typically not assessed on these ideas. The analysis results described in the report point to strengths and limitations of these widely used assessments and identify a range of good and poor assessment tasks that can shed light on important characteristics of good assessment.
Febres, Anthony; Vanegas, Oriana; Giammarresi, Michelle; Gomes, Carlos; Díaz, Emilia; Ponte-Sucre, Alicia
2018-07-01
The Calcitonin-Like Receptor (CLR) belongs to the classical seven-transmembrane segment molecules coupled to heterotrimeric G proteins. Its pharmacology depends on the simultaneous expression of the so-called Receptor Activity Modifier Proteins (RAMP-) -1, -2 and -3. RAMP-associated proteins modulate glycosylation and cellular traffic of CLR, therefore determining its pharmacodynamics. In higher eukaryotes, the complex formed by CLR and RAMP-1 is more akin to bind Calcitonin Gene-Related Peptide (CGRP), whereas those formed by CLR and RAMP-2 or RAMP-3, bind preferentially Adrenomedullin (AM). In lower eukaryotes, RAMPs, or any homologous protein, have not been identified until now. Herein we demonstrated a negative chemotactic response elicited by CGRP (10 -9 and 10 -8 M) and AM (10 -9 to 10 -5 M). Whether or not this response is receptor mediated should be verified, as well as the expression of a 24 kDa band in Leishmania, recognized by western blot analysis by the use of (human-)-RAMP-2 antibodies as detection probes. Queries with human RAMP-2 and RAMP-3 protein sequences in blastp against Leishmania (Viannia) braziliensis predicted proteome, allowed us to detect two sequence alignments in the parasite: A RAMP-2-aligned sequence corresponding to Leishmania folylpolyglutamate synthase (FPGS), and a RAMP-3 aligned protein, a hypothetical Leishmania protein with yet unknown function. The presence of homologous of these proteins was described in-silico in other members of the Trypanosomatidae. These preliminary and not yet complete data suggest the feasibility that both CGRP and Adrenomedullin activities may be regulated by homologs of RAMP- (-2) and (-3) in these parasites. Copyright © 2018 Elsevier B.V. All rights reserved.
Slocum, Joshua D; First, Jeremy T; Webb, Lauren J
2017-07-20
Measurement of the magnitude, direction, and functional importance of electric fields in biomolecules has been a long-standing experimental challenge. pK a shifts of titratable residues have been the most widely implemented measurements of the local electrostatic environment around the labile proton, and experimental data sets of pK a shifts in a variety of systems have been used to test and refine computational prediction capabilities of protein electrostatic fields. A more direct and increasingly popular technique to measure electric fields in proteins is Stark effect spectroscopy, where the change in absorption energy of a chromophore relative to a reference state is related to the change in electric field felt by the chromophore. While there are merits to both of these methods and they are both reporters of local electrostatic environment, they are fundamentally different measurements, and to our knowledge there has been no direct comparison of these two approaches in a single protein. We have recently demonstrated that green fluorescent protein (GFP) is an ideal model system for measuring changes in electric fields in a protein interior caused by amino acid mutations using both electronic and vibrational Stark effect chromophores. Here we report the changes in pK a of the GFP fluorophore in response to the same mutations and show that they are in excellent agreement with Stark effect measurements. This agreement in the results of orthogonal experiments reinforces our confidence in the experimental results of both Stark effect and pK a measurements and provides an excellent target data set to benchmark diverse protein electrostatics calculations. We used this experimental data set to test the pK a prediction ability of the adaptive Poisson-Boltzmann solver (APBS) and found that a simple continuum dielectric model of the GFP interior is insufficient to accurately capture the measured pK a and Stark effect shifts. We discuss some of the limitations of this continuum-based model in this system and offer this experimentally self-consistent data set as a target benchmark for electrostatics models, which could allow for a more rigorous test of pK a prediction techniques due to the unique environment of the water-filled GFP barrel compared to traditional globular proteins.
Learning a peptide-protein binding affinity predictor with kernel ridge regression
2013-01-01
Background The cellular function of a vast majority of proteins is performed through physical interactions with other biomolecules, which, most of the time, are other proteins. Peptides represent templates of choice for mimicking a secondary structure in order to modulate protein-protein interaction. They are thus an interesting class of therapeutics since they also display strong activity, high selectivity, low toxicity and few drug-drug interactions. Furthermore, predicting peptides that would bind to a specific MHC alleles would be of tremendous benefit to improve vaccine based therapy and possibly generate antibodies with greater affinity. Modern computational methods have the potential to accelerate and lower the cost of drug and vaccine discovery by selecting potential compounds for testing in silico prior to biological validation. Results We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalizes eight kernels, comprised of the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm for it’s approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first time, a machine learning predictor is capable of predicting the binding affinity of any peptide to any protein with reasonable accuracy. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets. Conclusion On all benchmarks, our method significantly (p-value ≤ 0.057) outperforms the current state-of-the-art methods at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any quantitative biological activity. Moreover, generating reliable peptide-protein binding affinities will also improve system biology modelling of interaction pathways. Lastly, the method should be of value to a large segment of the research community with the potential to accelerate the discovery of peptide-based drugs and facilitate vaccine development. The proposed kernel is freely available at http://graal.ift.ulaval.ca/downloads/gs-kernel/. PMID:23497081
Novel Computational Approaches to Drug Discovery
NASA Astrophysics Data System (ADS)
Skolnick, Jeffrey; Brylinski, Michal
2010-01-01
New approaches to protein functional inference based on protein structure and evolution are described. First, FINDSITE, a threading based approach to protein function prediction, is summarized. Then, the results of large scale benchmarking of ligand binding site prediction, ligand screening, including applications to HIV protease, and GO molecular functional inference are presented. A key advantage of FINDSITE is its ability to use low resolution, predicted structures as well as high resolution experimental structures. Then, an extension of FINDSITE to ligand screening in GPCRs using predicted GPCR structures, FINDSITE/QDOCKX, is presented. This is a particularly difficult case as there are few experimentally solved GPCR structures. Thus, we first train on a subset of known binding ligands for a set of GPCRs; this is then followed by benchmarking against a large ligand library. For the virtual ligand screening of a number of Dopamine receptors, encouraging results are seen, with significant enrichment in identified ligands over those found in the training set. Thus, FINDSITE and its extensions represent a powerful approach to the successful prediction of a variety of molecular functions.
@TOME-2: a new pipeline for comparative modeling of protein-ligand complexes.
Pons, Jean-Luc; Labesse, Gilles
2009-07-01
@TOME 2.0 is new web pipeline dedicated to protein structure modeling and small ligand docking based on comparative analyses. @TOME 2.0 allows fold recognition, template selection, structural alignment editing, structure comparisons, 3D-model building and evaluation. These tasks are routinely used in sequence analyses for structure prediction. In our pipeline the necessary software is efficiently interconnected in an original manner to accelerate all the processes. Furthermore, we have also connected comparative docking of small ligands that is performed using protein-protein superposition. The input is a simple protein sequence in one-letter code with no comment. The resulting 3D model, protein-ligand complexes and structural alignments can be visualized through dedicated Web interfaces or can be downloaded for further studies. These original features will aid in the functional annotation of proteins and the selection of templates for molecular modeling and virtual screening. Several examples are described to highlight some of the new functionalities provided by this pipeline. The server and its documentation are freely available at http://abcis.cbs.cnrs.fr/AT2/
Conserved thioredoxin fold is present in Pisum sativum L. sieve element occlusion-1 protein
Umate, Pavan; Tuteja, Renu
2010-01-01
Homology-based three-dimensional model for Pisum sativum sieve element occlusion 1 (Ps.SEO1) (forisomes) protein was constructed. A stretch of amino acids (residues 320 to 456) which is well conserved in all known members of forisomes proteins was used to model the 3D structure of Ps.SEO1. The structural prediction was done using Protein Homology/analogY Recognition Engine (PHYRE) web server. Based on studies of local sequence alignment, the thioredoxin-fold containing protein [Structural Classification of Proteins (SCOP) code d1o73a_], a member of the glutathione peroxidase family was selected as a template for modeling the spatial structure of Ps.SEO1. Selection was based on comparison of primary sequence, higher match quality and alignment accuracy. Motif 1 (EVF) is conserved in Ps.SEO1, Vicia faba (Vf.For1) and Medicago truncatula (MT.SEO3); motif 2 (KKED) is well conserved across all forisomes proteins and motif 3 (IGYIGNP) is conserved in Ps.SEO1 and Vf.For1. PMID:20404566
GRAMM-X public web server for protein–protein docking
Tovchigrechko, Andrey; Vakser, Ilya A.
2006-01-01
Protein docking software GRAMM-X and its web interface () extend the original GRAMM Fast Fourier Transformation methodology by employing smoothed potentials, refinement stage, and knowledge-based scoring. The web server frees users from complex installation of database-dependent parallel software and maintaining large hardware resources needed for protein docking simulations. Docking problems submitted to GRAMM-X server are processed by a 320 processor Linux cluster. The server was extensively tested by benchmarking, several months of public use, and participation in the CAPRI server track. PMID:16845016
Vinuesa, Pablo; Ochoa-Sánchez, Luz E; Contreras-Moreira, Bruno
2018-01-01
The massive accumulation of genome-sequences in public databases promoted the proliferation of genome-level phylogenetic analyses in many areas of biological research. However, due to diverse evolutionary and genetic processes, many loci have undesirable properties for phylogenetic reconstruction. These, if undetected, can result in erroneous or biased estimates, particularly when estimating species trees from concatenated datasets. To deal with these problems, we developed GET_PHYLOMARKERS, a pipeline designed to identify high-quality markers to estimate robust genome phylogenies from the orthologous clusters, or the pan-genome matrix (PGM), computed by GET_HOMOLOGUES. In the first context, a set of sequential filters are applied to exclude recombinant alignments and those producing anomalous or poorly resolved trees. Multiple sequence alignments and maximum likelihood (ML) phylogenies are computed in parallel on multi-core computers. A ML species tree is estimated from the concatenated set of top-ranking alignments at the DNA or protein levels, using either FastTree or IQ-TREE (IQT). The latter is used by default due to its superior performance revealed in an extensive benchmark analysis. In addition, parsimony and ML phylogenies can be estimated from the PGM. We demonstrate the practical utility of the software by analyzing 170 Stenotrophomonas genome sequences available in RefSeq and 10 new complete genomes of Mexican environmental S. maltophilia complex (Smc) isolates reported herein. A combination of core-genome and PGM analyses was used to revise the molecular systematics of the genus. An unsupervised learning approach that uses a goodness of clustering statistic identified 20 groups within the Smc at a core-genome average nucleotide identity (cgANIb) of 95.9% that are perfectly consistent with strongly supported clades on the core- and pan-genome trees. In addition, we identified 16 misclassified RefSeq genome sequences, 14 of them labeled as S. maltophilia , demonstrating the broad utility of the software for phylogenomics and geno-taxonomic studies. The code, a detailed manual and tutorials are freely available for Linux/UNIX servers under the GNU GPLv3 license at https://github.com/vinuesa/get_phylomarkers. A docker image bundling GET_PHYLOMARKERS with GET_HOMOLOGUES is available at https://hub.docker.com/r/csicunam/get_homologues/, which can be easily run on any platform.
Mahmood, Khalid; Jung, Chol-Hee; Philip, Gayle; Georgeson, Peter; Chung, Jessica; Pope, Bernard J; Park, Daniel J
2017-05-16
Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools. Apparent accuracies of variant effect prediction tools were influenced significantly by the benchmarking dataset. Benchmarking with the assay-determined datasets UniFun and BRCA1-DMS yielded areas under the receiver operating characteristic curves in the modest ranges of 0.52 to 0.63 and 0.54 to 0.75, respectively, considerably lower than observed for other, potentially more conflicted datasets. These results raise concerns about how such algorithms should be employed, particularly in a clinical setting. Contemporary variant effect prediction tools are unlikely to be as accurate at the general prediction of functional impacts on proteins as reported prior. Use of functional assay-based datasets that avoid prior dependencies promises to be valuable for the ongoing development and accurate benchmarking of such tools.
Ramus, Claire; Hovasse, Agnès; Marcellin, Marlène; Hesse, Anne-Marie; Mouton-Barbosa, Emmanuelle; Bouyssié, David; Vaca, Sebastian; Carapito, Christine; Chaoui, Karima; Bruley, Christophe; Garin, Jérôme; Cianférani, Sarah; Ferro, Myriam; Van Dorssaeler, Alain; Burlet-Schiltz, Odile; Schaeffer, Christine; Couté, Yohann; Gonzalez de Peredo, Anne
2016-01-30
Proteomic workflows based on nanoLC-MS/MS data-dependent-acquisition analysis have progressed tremendously in recent years. High-resolution and fast sequencing instruments have enabled the use of label-free quantitative methods, based either on spectral counting or on MS signal analysis, which appear as an attractive way to analyze differential protein expression in complex biological samples. However, the computational processing of the data for label-free quantification still remains a challenge. Here, we used a proteomic standard composed of an equimolar mixture of 48 human proteins (Sigma UPS1) spiked at different concentrations into a background of yeast cell lysate to benchmark several label-free quantitative workflows, involving different software packages developed in recent years. This experimental design allowed to finely assess their performances in terms of sensitivity and false discovery rate, by measuring the number of true and false-positive (respectively UPS1 or yeast background proteins found as differential). The spiked standard dataset has been deposited to the ProteomeXchange repository with the identifier PXD001819 and can be used to benchmark other label-free workflows, adjust software parameter settings, improve algorithms for extraction of the quantitative metrics from raw MS data, or evaluate downstream statistical methods. Bioinformatic pipelines for label-free quantitative analysis must be objectively evaluated in their ability to detect variant proteins with good sensitivity and low false discovery rate in large-scale proteomic studies. This can be done through the use of complex spiked samples, for which the "ground truth" of variant proteins is known, allowing a statistical evaluation of the performances of the data processing workflow. We provide here such a controlled standard dataset and used it to evaluate the performances of several label-free bioinformatics tools (including MaxQuant, Skyline, MFPaQ, IRMa-hEIDI and Scaffold) in different workflows, for detection of variant proteins with different absolute expression levels and fold change values. The dataset presented here can be useful for tuning software tool parameters, and also testing new algorithms for label-free quantitative analysis, or for evaluation of downstream statistical methods. Copyright © 2015 Elsevier B.V. All rights reserved.
Dafforn, Timothy R; Rajendra, Jacindra; Halsall, David J; Serpell, Louise C; Rodger, Alison
2004-01-01
High-resolution structure determination of soluble globular proteins relies heavily on x-ray crystallography techniques. Such an approach is often ineffective for investigations into the structure of fibrous proteins as these proteins generally do not crystallize. Thus investigations into fibrous protein structure have relied on less direct methods such as x-ray fiber diffraction and circular dichroism. Ultraviolet linear dichroism has the potential to provide additional information on the structure of such biomolecular systems. However, existing systems are not optimized for the requirements of fibrous proteins. We have designed and built a low-volume (200 microL), low-wavelength (down to 180 nm), low-pathlength (100 microm), high-alignment flow-alignment system (couette) to perform ultraviolet linear dichroism studies on the fibers formed by a range of biomolecules. The apparatus has been tested using a number of proteins for which longer wavelength linear dichroism spectra had already been measured. The new couette cell has also been used to obtain data on two medically important protein fibers, the all-beta-sheet amyloid fibers of the Alzheimer's derived protein Abeta and the long-chain assemblies of alpha1-antitrypsin polymers.
sc-PDB-Frag: a database of protein-ligand interaction patterns for Bioisosteric replacements.
Desaphy, Jérémy; Rognan, Didier
2014-07-28
Bioisosteric replacement plays an important role in medicinal chemistry by keeping the biological activity of a molecule while changing either its core scaffold or substituents, thereby facilitating lead optimization and patenting. Bioisosteres are classically chosen in order to keep the main pharmacophoric moieties of the substructure to replace. However, notably when changing a scaffold, no attention is usually paid as whether all atoms of the reference scaffold are equally important for binding to the desired target. We herewith propose a novel database for bioisosteric replacement (scPDBFrag), capitalizing on our recently published structure-based approach to scaffold hopping, focusing on interaction pattern graphs. Protein-bound ligands are first fragmented and the interaction of the corresponding fragments with their protein environment computed-on-the-fly. Using an in-house developed graph alignment tool, interaction patterns graphs can be compared, aligned, and sorted by decreasing similarity to any reference. In the herein presented sc-PDB-Frag database ( http://bioinfo-pharma.u-strasbg.fr/scPDBFrag ), fragments, interaction patterns, alignments, and pairwise similarity scores have been extracted from the sc-PDB database of 8077 druggable protein-ligand complexes and further stored in a relational database. We herewith present the database, its Web implementation, and procedures for identifying true bioisosteric replacements based on conserved interaction patterns.
Tobi, Dror
2017-08-01
A new algorithm for comparison of protein dynamics is presented. Compared protein structures are superposed and their modes of motions are calculated using the anisotropic network model. The obtained modes are aligned using the dynamic programming algorithm of Needleman and Wunsch, commonly used for sequence alignment. Dynamical comparison of hemoglobin in the T and R2 states reveals that the dynamics of the allosteric effector 2,3-bisphosphoglycerate binding site is different in the two states. These differences can contribute to the selectivity of the effector to the T state. Similar comparison of the ionotropic glutamate receptor in the kainate+(R,R)-2b and ZK bound states reveals that the kainate+(R,R)-2b bound states slow modes describe upward motions of ligand binding domain and the transmembrane domain regions. Such motions may lead to the opening of the receptor. The upper lobes of the LBDs of the ZK bound state have a smaller interface with the amino terminal domains above them and have a better ability to move together. The present study exemplifies the use of dynamics comparison as a tool to study protein function. Proteins 2017; 85:1507-1517. © 2014 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Template-based protein-protein docking exploiting pairwise interfacial residue restraints.
Xue, Li C; Rodrigues, João P G L M; Dobbs, Drena; Honavar, Vasant; Bonvin, Alexandre M J J
2017-05-01
Although many advanced and sophisticated ab initio approaches for modeling protein-protein complexes have been proposed in past decades, template-based modeling (TBM) remains the most accurate and widely used approach, given a reliable template is available. However, there are many different ways to exploit template information in the modeling process. Here, we systematically evaluate and benchmark a TBM method that uses conserved interfacial residue pairs as docking distance restraints [referred to as alpha carbon-alpha carbon (CA-CA)-guided docking]. We compare it with two other template-based protein-protein modeling approaches, including a conserved non-pairwise interfacial residue restrained docking approach [referred to as the ambiguous interaction restraint (AIR)-guided docking] and a simple superposition-based modeling approach. Our results show that, for most cases, the CA-CA-guided docking method outperforms both superposition with refinement and the AIR-guided docking method. We emphasize the superiority of the CA-CA-guided docking on cases with medium to large conformational changes, and interactions mediated through loops, tails or disordered regions. Our results also underscore the importance of a proper refinement of superimposition models to reduce steric clashes. In summary, we provide a benchmarked TBM protocol that uses conserved pairwise interface distance as restraints in generating realistic 3D protein-protein interaction models, when reliable templates are available. The described CA-CA-guided docking protocol is based on the HADDOCK platform, which allows users to incorporate additional prior knowledge of the target system to further improve the quality of the resulting models. © The Author 2016. Published by Oxford University Press.
2013-01-01
Background The goal of many proteomics experiments is to determine the abundance of proteins in biological samples, and the variation thereof in various physiological conditions. High-throughput quantitative proteomics, specifically label-free LC-MS/MS, allows rapid measurement of thousands of proteins, enabling large-scale studies of various biological systems. Prior to analyzing these information-rich datasets, raw data must undergo several computational processing steps. We present a method to address one of the essential steps in proteomics data processing - the matching of peptide measurements across samples. Results We describe a novel method for label-free proteomics data alignment with the ability to incorporate previously unused aspects of the data, particularly ion mobility drift times and product ion information. We compare the results of our alignment method to PEPPeR and OpenMS, and compare alignment accuracy achieved by different versions of our method utilizing various data characteristics. Our method results in increased match recall rates and similar or improved mismatch rates compared to PEPPeR and OpenMS feature-based alignment. We also show that the inclusion of drift time and product ion information results in higher recall rates and more confident matches, without increases in error rates. Conclusions Based on the results presented here, we argue that the incorporation of ion mobility drift time and product ion information are worthy pursuits. Alignment methods should be flexible enough to utilize all available data, particularly with recent advancements in experimental separation methods. PMID:24341404
Benjamin, Ashlee M; Thompson, J Will; Soderblom, Erik J; Geromanos, Scott J; Henao, Ricardo; Kraus, Virginia B; Moseley, M Arthur; Lucas, Joseph E
2013-12-16
The goal of many proteomics experiments is to determine the abundance of proteins in biological samples, and the variation thereof in various physiological conditions. High-throughput quantitative proteomics, specifically label-free LC-MS/MS, allows rapid measurement of thousands of proteins, enabling large-scale studies of various biological systems. Prior to analyzing these information-rich datasets, raw data must undergo several computational processing steps. We present a method to address one of the essential steps in proteomics data processing--the matching of peptide measurements across samples. We describe a novel method for label-free proteomics data alignment with the ability to incorporate previously unused aspects of the data, particularly ion mobility drift times and product ion information. We compare the results of our alignment method to PEPPeR and OpenMS, and compare alignment accuracy achieved by different versions of our method utilizing various data characteristics. Our method results in increased match recall rates and similar or improved mismatch rates compared to PEPPeR and OpenMS feature-based alignment. We also show that the inclusion of drift time and product ion information results in higher recall rates and more confident matches, without increases in error rates. Based on the results presented here, we argue that the incorporation of ion mobility drift time and product ion information are worthy pursuits. Alignment methods should be flexible enough to utilize all available data, particularly with recent advancements in experimental separation methods.
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
GASP: Gapped Ancestral Sequence Prediction for proteins
Edwards, Richard J; Shields, Denis C
2004-01-01
Background The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (indel) events or sequence fragments. Results Here we present a new algorithm, GASP (Gapped Ancestral Sequence Prediction), for predicting ancestral sequences from phylogenetic trees and the corresponding multiple sequence alignments. Alignments may be of any size and contain gaps. GASP first assigns the positions of gaps in the phylogeny before using a likelihood-based approach centred on amino acid substitution matrices to assign ancestral amino acids. Important outgroup information is used by first working down from the tips of the tree to the root, using descendant data only to assign probabilities, and then working back up from the root to the tips using descendant and outgroup data to make predictions. GASP was tested on a number of simulated datasets based on real phylogenies. Prediction accuracy for ungapped data was similar to three alternative algorithms tested, with GASP performing better in some cases and worse in others. Adding simple insertions and deletions to the simulated data did not have a detrimental effect on GASP accuracy. Conclusions GASP (Gapped Ancestral Sequence Prediction) will predict ancestral sequences from multiple protein alignments of any size. Although not as accurate in all cases as some of the more sophisticated maximum likelihood approaches, it can process a wide range of input phylogenies and will predict ancestral sequences for gapped and ungapped residues alike. PMID:15350199
2015-01-01
Benchmarking data sets have become common in recent years for the purpose of virtual screening, though the main focus had been placed on the structure-based virtual screening (SBVS) approaches. Due to the lack of crystal structures, there is great need for unbiased benchmarking sets to evaluate various ligand-based virtual screening (LBVS) methods for important drug targets such as G protein-coupled receptors (GPCRs). To date these ready-to-apply data sets for LBVS are fairly limited, and the direct usage of benchmarking sets designed for SBVS could bring the biases to the evaluation of LBVS. Herein, we propose an unbiased method to build benchmarking sets for LBVS and validate it on a multitude of GPCRs targets. To be more specific, our methods can (1) ensure chemical diversity of ligands, (2) maintain the physicochemical similarity between ligands and decoys, (3) make the decoys dissimilar in chemical topology to all ligands to avoid false negatives, and (4) maximize spatial random distribution of ligands and decoys. We evaluated the quality of our Unbiased Ligand Set (ULS) and Unbiased Decoy Set (UDS) using three common LBVS approaches, with Leave-One-Out (LOO) Cross-Validation (CV) and a metric of average AUC of the ROC curves. Our method has greatly reduced the “artificial enrichment” and “analogue bias” of a published GPCRs benchmarking set, i.e., GPCR Ligand Library (GLL)/GPCR Decoy Database (GDD). In addition, we addressed an important issue about the ratio of decoys per ligand and found that for a range of 30 to 100 it does not affect the quality of the benchmarking set, so we kept the original ratio of 39 from the GLL/GDD. PMID:24749745
Xia, Jie; Jin, Hongwei; Liu, Zhenming; Zhang, Liangren; Wang, Xiang Simon
2014-05-27
Benchmarking data sets have become common in recent years for the purpose of virtual screening, though the main focus had been placed on the structure-based virtual screening (SBVS) approaches. Due to the lack of crystal structures, there is great need for unbiased benchmarking sets to evaluate various ligand-based virtual screening (LBVS) methods for important drug targets such as G protein-coupled receptors (GPCRs). To date these ready-to-apply data sets for LBVS are fairly limited, and the direct usage of benchmarking sets designed for SBVS could bring the biases to the evaluation of LBVS. Herein, we propose an unbiased method to build benchmarking sets for LBVS and validate it on a multitude of GPCRs targets. To be more specific, our methods can (1) ensure chemical diversity of ligands, (2) maintain the physicochemical similarity between ligands and decoys, (3) make the decoys dissimilar in chemical topology to all ligands to avoid false negatives, and (4) maximize spatial random distribution of ligands and decoys. We evaluated the quality of our Unbiased Ligand Set (ULS) and Unbiased Decoy Set (UDS) using three common LBVS approaches, with Leave-One-Out (LOO) Cross-Validation (CV) and a metric of average AUC of the ROC curves. Our method has greatly reduced the "artificial enrichment" and "analogue bias" of a published GPCRs benchmarking set, i.e., GPCR Ligand Library (GLL)/GPCR Decoy Database (GDD). In addition, we addressed an important issue about the ratio of decoys per ligand and found that for a range of 30 to 100 it does not affect the quality of the benchmarking set, so we kept the original ratio of 39 from the GLL/GDD.
Bidargaddi, Niranjan P; Chetty, Madhu; Kamruzzaman, Joarder
2008-06-01
Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile HMM to overcome the limitations of that assumption and to achieve an improved alignment for protein sequences belonging to a given family. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures and Choquet integrals, thus further extends the generalized HMM. Based on the fuzzified forward and backward variables, we propose a fuzzy Baum-Welch parameter estimation algorithm for profiles. The strong correlations and the sequence preference involved in the protein structures make this fuzzy architecture based model as a suitable candidate for building profiles of a given family, since the fuzzy set can handle uncertainties better than classical methods.
Generate Optimized Genetic Rhythm for Enzyme Expression in Non-native systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
2016-11-03
Most amino acids are represented by more than one codon, resulting in redundancy in the genetic code. Silent codon substitutions that do not alter the amino acid sequence still have an effect on protein expression. We have developed an algorithm, GoGREEN, to enhance the expression of foreign proteins in a host organism. GoGREEN selects codons according to frequency patterns seen in the gene of interest using the codon usage table from the host organism. GoGREEN is also designed to accommodate gaps in the sequence.This software takes for input (1) the aligned protein sequences for genes the user wishes to express,more » (2) the codon usage table for the host organism, (3) and the DNA sequence for the target protein found in the host organism. The program will select codons based on codon usage patterns for the target DNA sequence. The program will also select codons for “gaps” found in the aligned protein sequences using the codon usage table from the host organism.« less
High scale impact in alignment and decoupling in two-Higgs-doublet models
NASA Astrophysics Data System (ADS)
Basler, Philipp; Ferreira, Pedro M.; Mühlleitner, Margarete; Santos, Rui
2018-05-01
The two-Higgs-doublet model (2HDM) provides an excellent benchmark to study physics beyond the Standard Model (SM). In this work, we discuss how the behavior of the model at high-energy scales causes it to have a scalar with properties very similar to those of the SM—which means the 2HDM can be seen to naturally favor a decoupling or alignment limit. For a type II 2HDM, we show that requiring the model to be theoretically valid up to a scale of 1 TeV, by studying the renormalization group equations (RGE) of the parameters of the model, causes a significant reduction in the allowed magnitude of the quartic couplings. This, combined with B -physics bounds, forces the model to be naturally decoupled. As a consequence, any nondecoupling limits in type II, like the wrong-sign scenario, are excluded. On the contrary, even with the very constraining limits for the Higgs couplings from the LHC, the type I model can deviate substantially from alignment. An RGE analysis similar to that made for type II shows, however, that requiring a single scalar to be heavier than about 500 GeV would be sufficient for the model to be decoupled. Finally, we show that the 2HDM is stable up to the Planck scale independently of which of the C P -even scalars is the discovered 125 GeV Higgs boson.
A 2D eye gaze estimation system with low-resolution webcam images
NASA Astrophysics Data System (ADS)
Ince, Ibrahim Furkan; Kim, Jin Woo
2011-12-01
In this article, a low-cost system for 2D eye gaze estimation with low-resolution webcam images is presented. Two algorithms are proposed for this purpose, one for the eye-ball detection with stable approximate pupil-center and the other one for the eye movements' direction detection. Eyeball is detected using deformable angular integral search by minimum intensity (DAISMI) algorithm. Deformable template-based 2D gaze estimation (DTBGE) algorithm is employed as a noise filter for deciding the stable movement decisions. While DTBGE employs binary images, DAISMI employs gray-scale images. Right and left eye estimates are evaluated separately. DAISMI finds the stable approximate pupil-center location by calculating the mass-center of eyeball border vertices to be employed for initial deformable template alignment. DTBGE starts running with initial alignment and updates the template alignment with resulting eye movements and eyeball size frame by frame. The horizontal and vertical deviation of eye movements through eyeball size is considered as if it is directly proportional with the deviation of cursor movements in a certain screen size and resolution. The core advantage of the system is that it does not employ the real pupil-center as a reference point for gaze estimation which is more reliable against corneal reflection. Visual angle accuracy is used for the evaluation and benchmarking of the system. Effectiveness of the proposed system is presented and experimental results are shown.
Kim, Won Hwa; Chung, Moo K; Singh, Vikas
2013-01-01
The analysis of 3-D shape meshes is a fundamental problem in computer vision, graphics, and medical imaging. Frequently, the needs of the application require that our analysis take a multi-resolution view of the shape's local and global topology, and that the solution is consistent across multiple scales. Unfortunately, the preferred mathematical construct which offers this behavior in classical image/signal processing, Wavelets, is no longer applicable in this general setting (data with non-uniform topology). In particular, the traditional definition does not allow writing out an expansion for graphs that do not correspond to the uniformly sampled lattice (e.g., images). In this paper, we adapt recent results in harmonic analysis, to derive Non-Euclidean Wavelets based algorithms for a range of shape analysis problems in vision and medical imaging. We show how descriptors derived from the dual domain representation offer native multi-resolution behavior for characterizing local/global topology around vertices. With only minor modifications, the framework yields a method for extracting interest/key points from shapes, a surprisingly simple algorithm for 3-D shape segmentation (competitive with state of the art), and a method for surface alignment (without landmarks). We give an extensive set of comparison results on a large shape segmentation benchmark and derive a uniqueness theorem for the surface alignment problem.
BIOREL: the benchmark resource to estimate the relevance of the gene networks.
Antonov, Alexey V; Mewes, Hans W
2006-02-06
The progress of high-throughput methodologies in functional genomics has lead to the development of statistical procedures to infer gene networks from various types of high-throughput data. However, due to the lack of common standards, the biological significance of the results of the different studies is hard to compare. To overcome this problem we propose a benchmark procedure and have developed a web resource (BIOREL), which is useful for estimating the biological relevance of any genetic network by integrating different sources of biological information. The associations of each gene from the network are classified as biologically relevant or not. The proportion of genes in the network classified as "relevant" is used as the overall network relevance score. Employing synthetic data we demonstrated that such a score ranks the networks fairly in respect to the relevance level. Using BIOREL as the benchmark resource we compared the quality of experimental and theoretically predicted protein interaction data.
ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes.
Otto, Thomas Dan; Catanho, Marcos; Tristão, Cristian; Bezerra, Márcia; Fernandes, Renan Mathias; Elias, Guilherme Steinberger; Scaglia, Alexandre Capeletto; Bovermann, Bill; Berstis, Viktors; Lifschitz, Sergio; de Miranda, Antonio Basílio; Degrave, Wim
2010-03-01
Many analyses in modern biological research are based on comparisons between biological sequences, resulting in functional, evolutionary and structural inferences. When large numbers of sequences are compared, heuristics are often used resulting in a certain lack of accuracy. In order to improve and validate results of such comparisons, we have performed radical all-against-all comparisons of 4 million protein sequences belonging to the RefSeq database, using an implementation of the Smith-Waterman algorithm. This extremely intensive computational approach was made possible with the help of World Community Grid, through the Genome Comparison Project. The resulting database, ProteinWorldDB, which contains coordinates of pairwise protein alignments and their respective scores, is now made available. Users can download, compare and analyze the results, filtered by genomes, protein functions or clusters. ProteinWorldDB is integrated with annotations derived from Swiss-Prot, Pfam, KEGG, NCBI Taxonomy database and gene ontology. The database is a unique and valuable asset, representing a major effort to create a reliable and consistent dataset of cross-comparisons of the whole protein content encoded in hundreds of completely sequenced genomes using a rigorous dynamic programming approach. The database can be accessed through http://proteinworlddb.org
Elongational Flow Assists with the Assembly of Protein Nanofibrils
NASA Astrophysics Data System (ADS)
Mittal, Nitesh; Kamada, Ayaka; Lendel, Christofer; Lundell, Fredrik; Soderberg, Daniel
2016-11-01
Controlling the aggregation process of protein-based macromolecular structures in a confined environment using small-scale flow devices and understanding their assembly mechanisms is essential to develop bio-based materials. Whey protein, a protein mixture with β-lactoglobulin as main component, is able to self-assemble into amyloid-like protein nanofibers which are stabilized by hydrogen bonds. The conditions at which the fibrillation process occurs can affect the properties and morphology of the fibrils. Here, we show that the morphology of protein nanofibers greatly affects their assembly. We used elongational flow based double flow-focusing device for this study. In-situ behavior of the straight and flexible fibrils in the flow channel is determined using small-angle X-ray scattering (SAXS) technique. Our process combines hydrodynamic alignment with dispersion to gel-transition that produces homogeneous and smooth fibers. Moreover, successful alignment before gelation demands a proper separation of the time-scales involved, which we tried to identify in the current study. The presented approach combining small scale flow devices with in-situ synchrotron X-ray studies and protein engineering is a promising route to design high performance protein-based materials with controlled physical and chemical properties. We acknowledge the support from Wallenberg Wood Science Center.
Tamura, Takeyuki; Akutsu, Tatsuya
2007-11-30
Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies. In this paper, we propose a novel and general predicting method by combining techniques for sequence alignment and feature vectors based on amino acid composition. We implemented this method with support vector machines on plant data sets extracted from the TargetP database. Through fivefold cross validation tests, the obtained overall accuracies and average MCC were 0.9096 and 0.8655 respectively. We also applied our method to other datasets including that of WoLF PSORT. Although there is a predictor which uses the information of gene ontology and yields higher accuracy than ours, our accuracies are higher than existing predictors which use only sequence information. Since such information as gene ontology can be obtained only for known proteins, our predictor is considered to be useful for subcellular location prediction of newly-discovered proteins. Furthermore, the idea of combination of alignment and amino acid frequency is novel and general so that it may be applied to other problems in bioinformatics. Our method for plant is also implemented as a web-system and available on http://sunflower.kuicr.kyoto-u.ac.jp/~tamura/slpfa.html.
PconsFold: improved contact predictions improve protein models.
Michel, Mirco; Hayat, Sikander; Skwark, Marcin J; Sander, Chris; Marks, Debora S; Elofsson, Arne
2014-09-01
Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used. In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15-30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved. PconsFold is a fully automated pipeline for ab initio protein structure prediction based on evolutionary information. PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol. Due to its modularity, the contact prediction tool can be easily exchanged. The source code of PconsFold is available on GitHub at https://www.github.com/ElofssonLab/pcons-fold under the MIT license. PconsC is available from http://c.pcons.net/. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
MACSIMS : multiple alignment of complete sequences information management system
Thompson, Julie D; Muller, Arnaud; Waterhouse, Andrew; Procter, Jim; Barton, Geoffrey J; Plewniak, Frédéric; Poch, Olivier
2006-01-01
Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at . PMID:16792820
MaxAlign: maximizing usable data in an alignment.
Gouveia-Oliveira, Rodrigo; Sackett, Peter W; Pedersen, Anders G
2007-08-28
The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic and other analyses, for instance, gapped columns are often discarded entirely from the alignment. MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns - the alignment area - by selecting the optimal subset of sequences to exclude from the alignment. MaxAlign can be used prior to phylogenetic and bioinformatical analyses as well as in other situations where this form of alignment improvement is useful. In this work we test MaxAlign's performance in these tasks and compare the accuracy of phylogenetic estimates including and excluding gapped columns from the analysis, with and without processing with MaxAlign. In this paper we also introduce a new simple measure of tree similarity, Normalized Symmetric Similarity (NSS) that we consider useful for comparing tree topologies. We demonstrate how MaxAlign is helpful in detecting misaligned or defective sequences without requiring manual inspection. We also show that it is not advisable to exclude gapped columns from phylogenetic analyses unless MaxAlign is used first. Finally, we find that the sequences removed by MaxAlign from an alignment tend to be those that would otherwise be associated with low phylogenetic accuracy, and that the presence of gaps in any given sequence does not seem to disturb the phylogenetic estimates of other sequences. The MaxAlign web-server is freely available online at http://www.cbs.dtu.dk/services/MaxAlign where supplementary information can also be found. The program is also freely available as a Perl stand-alone package.
Tsukui, Shu; Kimura, Fumiko; Kusaka, Katsuhiro; Baba, Seiki; Mizuno, Nobuhiro; Kimura, Tsunehisa
2016-07-01
Protein microcrystals magnetically aligned in D2O hydrogels were subjected to neutron diffraction measurements, and reflections were observed for the first time to a resolution of 3.4 Å from lysozyme microcrystals (∼10 × 10 × 50 µm). This result demonstrated the possibility that magnetically oriented microcrystals consolidated in D2O gels may provide a promising means to obtain single-crystal neutron diffraction from proteins that do not crystallize at the sizes required for neutron diffraction structure determination. In addition, lysozyme microcrystals aligned in H2O hydrogels allowed structure determination at a resolution of 1.76 Å at room temperature by X-ray diffraction. The use of gels has advantages since the microcrystals are measured under hydrated conditions.
Craig, George D.; Glass, Robert; Rupp, Bernhard
1997-01-01
A method for forming synthetic crystals of proteins in a carrier fluid by use of the dipole moments of protein macromolecules that self-align in the Helmholtz layer adjacent to an electrode. The voltage gradients of such layers easily exceed 10.sup.6 V/m. The synthetic protein crystals are subjected to x-ray crystallography to determine the conformational structure of the protein involved.
Combining Rosetta with molecular dynamics (MD): A benchmark of the MD-based ensemble protein design.
Ludwiczak, Jan; Jarmula, Adam; Dunin-Horkawicz, Stanislaw
2018-07-01
Computational protein design is a set of procedures for computing amino acid sequences that will fold into a specified structure. Rosetta Design, a commonly used software for protein design, allows for the effective identification of sequences compatible with a given backbone structure, while molecular dynamics (MD) simulations can thoroughly sample near-native conformations. We benchmarked a procedure in which Rosetta design is started on MD-derived structural ensembles and showed that such a combined approach generates 20-30% more diverse sequences than currently available methods with only a slight increase in computation time. Importantly, the increase in diversity is achieved without a loss in the quality of the designed sequences assessed by their resemblance to natural sequences. We demonstrate that the MD-based procedure is also applicable to de novo design tasks started from backbone structures without any sequence information. In addition, we implemented a protocol that can be used to assess the stability of designed models and to select the best candidates for experimental validation. In sum our results demonstrate that the MD ensemble-based flexible backbone design can be a viable method for protein design, especially for tasks that require a large pool of diverse sequences. Copyright © 2018 Elsevier Inc. All rights reserved.
HPEPDOCK: a web server for blind peptide-protein docking based on a hierarchical algorithm.
Zhou, Pei; Jin, Bowen; Li, Hao; Huang, Sheng-You
2018-05-09
Protein-peptide interactions are crucial in many cellular functions. Therefore, determining the structure of protein-peptide complexes is important for understanding the molecular mechanism of related biological processes and developing peptide drugs. HPEPDOCK is a novel web server for blind protein-peptide docking through a hierarchical algorithm. Instead of running lengthy simulations to refine peptide conformations, HPEPDOCK considers the peptide flexibility through an ensemble of peptide conformations generated by our MODPEP program. For blind global peptide docking, HPEPDOCK obtained a success rate of 33.3% in binding mode prediction on a benchmark of 57 unbound cases when the top 10 models were considered, compared to 21.1% for pepATTRACT server. HPEPDOCK also performed well in docking against homology models and obtained a success rate of 29.8% within top 10 predictions. For local peptide docking, HPEPDOCK achieved a high success rate of 72.6% on a benchmark of 62 unbound cases within top 10 predictions, compared to 45.2% for HADDOCK peptide protocol. Our HPEPDOCK server is computationally efficient and consumed an average of 29.8 mins for a global peptide docking job and 14.2 mins for a local peptide docking job. The HPEPDOCK web server is available at http://huanglab.phys.hust.edu.cn/hpepdock/.
Strong Keratin-like Nanofibers Made of Globular Protein
NASA Astrophysics Data System (ADS)
Dror, Yael; Makarov, Vadim; Admon, Arie; Zussman, Eyal
2008-03-01
Protein fibers as elementary structural and functional elements in nature inspire the engineering of protein-based products for versatile bio-medical applications. We have recently used the electrospinning process to fabricate strong sub-micron fibers made solely of serum albumin (SA). This raises the challenges of turning a globular non-viscous protein solution into a polymer--like spinnable solution and producing keratin-like fibers enriched in inter S-S bridges. A stable spinning process was achieved by using SA solution in a rich trifluoroethanol-water mixture with β-mercaptoethanol. The breakage of the intra disulfide bridges, as identified by mass spectrometry, together with the denaturing alcohol, enabled a pronounced expansion of the protein. This in turn, affects the rheological properties of the solution. X-ray diffraction pattern of the fibers revealed equatorial orientation, indicating the alignment of structures along the fiber axis. The mechanical properties reached remarkable average values (Young's modulus of 1.6GPa, and max stress of 36MPa) as compared to other fibrous protein nanofibers. These significant results are attributed to both the alignment and inter disulfide bonds (cross linking) that were formed by spontaneous post-spinning oxidation.
Use of conserved key amino acid positions to morph protein folds.
Reddy, Boojala V B; Li, Wilfred W; Bourne, Philip E
2002-07-15
By using three-dimensional (3D) structure alignments and a previously published method to determine Conserved Key Amino Acid Positions (CKAAPs) we propose a theoretical method to design mutations that can be used to morph the protein folds. The original Paracelsus challenge, met by several groups, called for the engineering of a stable but different structure by modifying less than 50% of the amino acid residues. We have used the sequences from the Protein Data Bank (PDB) identifiers 1ROP, and 2CRO, which were previously used in the Paracelsus challenge by those groups, and suggest mutation to CKAAPs to morph the protein fold. The total number of mutations suggested is less than 40% of the starting sequence theoretically improving the challenge results. From secondary structure prediction experiments of the proposed mutant sequence structures, we observe that each of the suggested mutant protein sequences likely folds to a different, non-native potentially stable target structure. These results are an early indicator that analyses using structure alignments leading to CKAAPs of a given structure are of value in protein engineering experiments. Copyright 2002 Wiley Periodicals, Inc.
Identifying functionally informative evolutionary sequence profiles.
Gil, Nelson; Fiser, Andras
2018-04-15
Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.
Simultaneous phylogeny reconstruction and multiple sequence alignment
Yue, Feng; Shi, Jian; Tang, Jijun
2009-01-01
Background A phylogeny is the evolutionary history of a group of organisms. To date, sequence data is still the most used data type for phylogenetic reconstruction. Before any sequences can be used for phylogeny reconstruction, they must be aligned, and the quality of the multiple sequence alignment has been shown to affect the quality of the inferred phylogeny. At the same time, all the current multiple sequence alignment programs use a guide tree to produce the alignment and experiments showed that good guide trees can significantly improve the multiple alignment quality. Results We devise a new algorithm to simultaneously align multiple sequences and search for the phylogenetic tree that leads to the best alignment. We also implemented the algorithm as a C program package, which can handle both DNA and protein data and can take simple cost model as well as complex substitution matrices, such as PAM250 or BLOSUM62. The performance of the new method are compared with those from other popular multiple sequence alignment tools, including the widely used programs such as ClustalW and T-Coffee. Experimental results suggest that this method has good performance in terms of both phylogeny accuracy and alignment quality. Conclusion We present an algorithm to align multiple sequences and reconstruct the phylogenies that minimize the alignment score, which is based on an efficient algorithm to solve the median problems for three sequences. Our extensive experiments suggest that this method is very promising and can produce high quality phylogenies and alignments. PMID:19208110
NASA Astrophysics Data System (ADS)
Sahu, Indra D.; Hustedt, Eric J.; Ghimire, Harishchandra; Inbaraj, Johnson J.; McCarrick, Robert M.; Lorigan, Gary A.
2014-12-01
An EPR membrane alignment technique was applied to measure distance and relative orientations between two spin labels on a protein oriented along the surface of the membrane. Previously we demonstrated an EPR membrane alignment technique for measuring distances and relative orientations between two spin labels using a dual TOAC-labeled integral transmembrane peptide (M2δ segment of Acetylcholine receptor) as a test system. In this study we further utilized this technique and successfully measured the distance and relative orientations between two spin labels on a membrane peripheral peptide (antimicrobial peptide magainin-2). The TOAC-labeled magainin-2 peptides were mechanically aligned using DMPC lipids on a planar quartz support, and CW-EPR spectra were recorded at specific orientations. Global analysis in combination with rigorous spectral simulation was used to simultaneously analyze data from two different sample orientations for both single- and double-labeled peptides. We measured an internitroxide distance of 15.3 Å from a dual TOAC-labeled magainin-2 peptide at positions 8 and 14 that closely matches with the 13.3 Å distance obtained from a model of the labeled magainin peptide. In addition, the angles determining the relative orientations of the two nitroxides have been determined, and the results compare favorably with molecular modeling. This study demonstrates the utility of the technique for proteins oriented along the surface of the membrane in addition to the previous results for proteins situated within the membrane bilayer.
Community detection in sequence similarity networks based on attribute clustering
Chowdhary, Janamejaya; Loeffler, Frank E.; Smith, Jeremy C.
2017-07-24
Networks are powerful tools for the presentation and analysis of interactions in multi-component systems. A commonly studied mesoscopic feature of networks is their community structure, which arises from grouping together similar nodes into one community and dissimilar nodes into separate communities. Here in this paper, the community structure of protein sequence similarity networks is determined with a new method: Attribute Clustering Dependent Communities (ACDC). Sequence similarity has hitherto typically been quantified by the alignment score or its expectation value. However, pair alignments with the same score or expectation value cannot thus be differentiated. To overcome this deficiency, the method constructs,more » for pair alignments, an extended alignment metric, the link attribute vector, which includes the score and other alignment characteristics. Rescaling components of the attribute vectors qualitatively identifies a systematic variation of sequence similarity within protein superfamilies. The problem of community detection is then mapped to clustering the link attribute vectors, selection of an optimal subset of links and community structure refinement based on the partition density of the network. ACDC-predicted communities are found to be in good agreement with gold standard sequence databases for which the "ground truth" community structures (or families) are known. ACDC is therefore a community detection method for sequence similarity networks based entirely on pair similarity information. A serial implementation of ACDC is available from https://cmb.ornl.gov/resources/developments« less
Community detection in sequence similarity networks based on attribute clustering
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chowdhary, Janamejaya; Loeffler, Frank E.; Smith, Jeremy C.
Networks are powerful tools for the presentation and analysis of interactions in multi-component systems. A commonly studied mesoscopic feature of networks is their community structure, which arises from grouping together similar nodes into one community and dissimilar nodes into separate communities. Here in this paper, the community structure of protein sequence similarity networks is determined with a new method: Attribute Clustering Dependent Communities (ACDC). Sequence similarity has hitherto typically been quantified by the alignment score or its expectation value. However, pair alignments with the same score or expectation value cannot thus be differentiated. To overcome this deficiency, the method constructs,more » for pair alignments, an extended alignment metric, the link attribute vector, which includes the score and other alignment characteristics. Rescaling components of the attribute vectors qualitatively identifies a systematic variation of sequence similarity within protein superfamilies. The problem of community detection is then mapped to clustering the link attribute vectors, selection of an optimal subset of links and community structure refinement based on the partition density of the network. ACDC-predicted communities are found to be in good agreement with gold standard sequence databases for which the "ground truth" community structures (or families) are known. ACDC is therefore a community detection method for sequence similarity networks based entirely on pair similarity information. A serial implementation of ACDC is available from https://cmb.ornl.gov/resources/developments« less
Raghunathan, VijayKrishna; McKee, Clayton; Cheung, Wai; Naik, Rachel; Nealey, Paul F.; Russell, Paul
2013-01-01
The basement membrane (BM) of the corneal epithelium presents biophysical cues in the form of topography and compliance that can impact the phenotype and behaviors of cells and their nuclei through modulation of cytoskeletal dynamics. In addition, it is also well known that the intrinsic biochemical attributes of BMs can modulate cell behaviors. In this study, the influence of the combination of exogenous coating of extracellular matrix proteins (ECM) (fibronectin-collagen [FNC]) with substratum topography was investigated on cytoskeletal architecture as well as alignment and migration of immortalized corneal epithelial cells. In the absence of FNC coating, a significantly greater percentage of cells aligned parallel with the long axis of the underlying anisotropically ordered topographic features; however, their ability to migrate was impaired. Additionally, changes in the surface area, elongation, and orientation of cytoskeletal elements were differentially influenced by the presence or absence of FNC. These results suggest that the effects of topographic cues on cells are modulated by the presence of surface-associated ECM proteins. These findings have relevance to experiments using cell cultureware with biomimetic biophysical attributes as well as the integration of biophysical cues in tissue-engineering strategies and the development of improved prosthetics. PMID:23488816
AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis
Aniba, Mohamed Radhouene; Poch, Olivier; Marchler-Bauer, Aron; Thompson, Julie Dawn
2010-01-01
Multiple sequence alignment (MSA) is a cornerstone of modern molecular biology and represents a unique means of investigating the patterns of conservation and diversity in complex biological systems. Many different algorithms have been developed to construct MSAs, but previous studies have shown that no single aligner consistently outperforms the rest. This has led to the development of a number of ‘meta-methods’ that systematically run several aligners and merge the output into one single solution. Although these methods generally produce more accurate alignments, they are inefficient because all the aligners need to be run first and the choice of the best solution is made a posteriori. Here, we describe the development of a new expert system, AlexSys, for the multiple alignment of protein sequences. AlexSys incorporates an intelligent inference engine to automatically select an appropriate aligner a priori, depending only on the nature of the input sequences. The inference engine was trained on a large set of reference multiple alignments, using a novel machine learning approach. Applying AlexSys to a test set of 178 alignments, we show that the expert system represents a good compromise between alignment quality and running time, making it suitable for high throughput projects. AlexSys is freely available from http://alnitak.u-strasbg.fr/∼aniba/alexsys. PMID:20530533
Effect of Electromechanical Stimulation on the Maturation of Myotubes on Aligned Electrospun Fibers
Liao, I-Chien; Liu, Jason B.; Bursac, Nenad; Leong, Kam W.
2009-01-01
Tissue engineering may provide an alternative to cell injection as a therapeutic solution for myocardial infarction. A tissue-engineered muscle patch may offer better host integration and higher functional performance. This study examined the differentiation of skeletal myoblasts on aligned electrospun polyurethane (PU) fibers and in the presence of electromechanical stimulation. Skeletal myoblasts cultured on aligned PU fibers showed more pronounced elongation, better alignment, higher level of transient receptor potential cation channel-1 (TRPC-1) expression, upregulation of contractile proteins and higher percentage of striated myotubes compared to those cultured on random PU fibers and film. The resulting tissue constructs generated tetanus forces of 1.1 mN with a 10-ms time to tetanus. Additional mechanical, electrical, or synchronized electromechanical stimuli applied to myoblasts cultured on PU fibers increased the percentage of striated myotubes from 70 to 85% under optimal stimulation conditions, which was accompanied by an upregulation of contractile proteins such as α-actinin and myosin heavy chain. In describing how electromechanical cues can be combined with topographical cue, this study helped move towards the goal of generating a biomimetic microenvironment for engineering of functional skeletal muscle. PMID:19774099
SW#db: GPU-Accelerated Exact Sequence Similarity Database Search.
Korpar, Matija; Šošić, Martin; Blažeka, Dino; Šikić, Mile
2015-01-01
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.
Borozan, Ivan; Watt, Stuart; Ferretti, Vincent
2015-05-01
Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. ivan.borozan@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Borozan, Ivan; Watt, Stuart; Ferretti, Vincent
2015-01-01
Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913
Adhikari, Badri; Hou, Jie; Cheng, Jianlin
2018-03-01
In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66. © 2017 Wiley Periodicals, Inc.
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T.; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J.; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A.; Lempicki, Richard A.; Huang, Da Wei
2013-01-01
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results. PMID:24179701
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS.
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A; Lempicki, Richard A; Huang, Da Wei
2013-07-31
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
NASA Astrophysics Data System (ADS)
Govin, A.; Capron, E.; Tzedakis, P. C.; Verheyden, S.; Ghaleb, B.; Hillaire-Marcel, C.; St-Onge, G.; Stoner, J. S.; Bassinot, F.; Bazin, L.; Blunier, T.; Combourieu-Nebout, N.; El Ouahabi, A.; Genty, D.; Gersonde, R.; Jimenez-Amat, P.; Landais, A.; Martrat, B.; Masson-Delmotte, V.; Parrenin, F.; Seidenkrantz, M.-S.; Veres, D.; Waelbroeck, C.; Zahn, R.
2015-12-01
The Last Interglacial (LIG) represents an invaluable case study to investigate the response of components of the Earth system to global warming. However, the scarcity of absolute age constraints in most archives leads to extensive use of various stratigraphic alignments to different reference chronologies. This feature sets limitations to the accuracy of the stratigraphic assignment of the climatic sequence of events across the globe during the LIG. Here, we review the strengths and limitations of the methods that are commonly used to date or develop chronologies in various climatic archives for the time span (∼140-100 ka) encompassing the penultimate deglaciation, the LIG and the glacial inception. Climatic hypotheses underlying record alignment strategies and the interpretation of tracers are explicitly described. Quantitative estimates of the associated absolute and relative age uncertainties are provided. Recommendations are subsequently formulated on how best to define absolute and relative chronologies. Future climato-stratigraphic alignments should provide (1) a clear statement of climate hypotheses involved, (2) a detailed understanding of environmental parameters controlling selected tracers and (3) a careful evaluation of the synchronicity of aligned paleoclimatic records. We underscore the need to (1) systematically report quantitative estimates of relative and absolute age uncertainties, (2) assess the coherence of chronologies when comparing different records, and (3) integrate these uncertainties in paleoclimatic interpretations and comparisons with climate simulations. Finally, we provide a sequence of major climatic events with associated age uncertainties for the period 140-105 ka, which should serve as a new benchmark to disentangle mechanisms of the Earth system's response to orbital forcing and evaluate transient climate simulations.
Protein-protein docking using region-based 3D Zernike descriptors
2009-01-01
Background Protein-protein interactions are a pivotal component of many biological processes and mediate a variety of functions. Knowing the tertiary structure of a protein complex is therefore essential for understanding the interaction mechanism. However, experimental techniques to solve the structure of the complex are often found to be difficult. To this end, computational protein-protein docking approaches can provide a useful alternative to address this issue. Prediction of docking conformations relies on methods that effectively capture shape features of the participating proteins while giving due consideration to conformational changes that may occur. Results We present a novel protein docking algorithm based on the use of 3D Zernike descriptors as regional features of molecular shape. The key motivation of using these descriptors is their invariance to transformation, in addition to a compact representation of local surface shape characteristics. Docking decoys are generated using geometric hashing, which are then ranked by a scoring function that incorporates a buried surface area and a novel geometric complementarity term based on normals associated with the 3D Zernike shape description. Our docking algorithm was tested on both bound and unbound cases in the ZDOCK benchmark 2.0 dataset. In 74% of the bound docking predictions, our method was able to find a near-native solution (interface C-αRMSD ≤ 2.5 Å) within the top 1000 ranks. For unbound docking, among the 60 complexes for which our algorithm returned at least one hit, 60% of the cases were ranked within the top 2000. Comparison with existing shape-based docking algorithms shows that our method has a better performance than the others in unbound docking while remaining competitive for bound docking cases. Conclusion We show for the first time that the 3D Zernike descriptors are adept in capturing shape complementarity at the protein-protein interface and useful for protein docking prediction. Rigorous benchmark studies show that our docking approach has a superior performance compared to existing methods. PMID:20003235
Protein-protein docking using region-based 3D Zernike descriptors.
Venkatraman, Vishwesh; Yang, Yifeng D; Sael, Lee; Kihara, Daisuke
2009-12-09
Protein-protein interactions are a pivotal component of many biological processes and mediate a variety of functions. Knowing the tertiary structure of a protein complex is therefore essential for understanding the interaction mechanism. However, experimental techniques to solve the structure of the complex are often found to be difficult. To this end, computational protein-protein docking approaches can provide a useful alternative to address this issue. Prediction of docking conformations relies on methods that effectively capture shape features of the participating proteins while giving due consideration to conformational changes that may occur. We present a novel protein docking algorithm based on the use of 3D Zernike descriptors as regional features of molecular shape. The key motivation of using these descriptors is their invariance to transformation, in addition to a compact representation of local surface shape characteristics. Docking decoys are generated using geometric hashing, which are then ranked by a scoring function that incorporates a buried surface area and a novel geometric complementarity term based on normals associated with the 3D Zernike shape description. Our docking algorithm was tested on both bound and unbound cases in the ZDOCK benchmark 2.0 dataset. In 74% of the bound docking predictions, our method was able to find a near-native solution (interface C-alphaRMSD < or = 2.5 A) within the top 1000 ranks. For unbound docking, among the 60 complexes for which our algorithm returned at least one hit, 60% of the cases were ranked within the top 2000. Comparison with existing shape-based docking algorithms shows that our method has a better performance than the others in unbound docking while remaining competitive for bound docking cases. We show for the first time that the 3D Zernike descriptors are adept in capturing shape complementarity at the protein-protein interface and useful for protein docking prediction. Rigorous benchmark studies show that our docking approach has a superior performance compared to existing methods.
Monitoring the delivery of cancer care: Commission on Cancer and National Cancer Data Base.
Williams, Richelle T; Stewart, Andrew K; Winchester, David P
2012-07-01
The primary objective of the Commission on Cancer (CoC) is to ensure the delivery of comprehensive, high-quality care that improves survival while maintaining quality of life for patients with cancer. This article examines the initiatives of the CoC toward achieving this goal, utilizing data from the National Cancer Data Base (NCDB) to monitor treatment patterns and outcomes, to develop quality measures, and to benchmark hospital performance. The article also highlights how these initiatives align with the Institute of Medicine's recommendations for improving the quality of cancer care and briefly explores future projects of the CoC and NCDB. Copyright © 2012 Elsevier Inc. All rights reserved.
Consumer Views on Plug-in Electric Vehicles -- National Benchmark Report (Second Edition)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Singer, Mark
2016-12-01
Vehicle manufacturers, government agencies, universities, private researchers, and organizations worldwide are pursuing advanced vehicle technologies that aim to reduce the consumption of petroleum in the forms of gasoline and diesel. Plug-in electric vehicles (PEVs) are one such technology. This report, an update to the version published in January 2016, details findings from a study in February 2015 of broad American public sentiments toward issues that surround PEVs. This report is supported by the U.S. Department of Energy's Vehicle Technologies Office in alignment with its mission to develop and deploy these technologies to improve energy security, enhance mobility flexibility, reduce transportationmore » costs, and increase environmental sustainability.« less
Measuring and monitoring IT using a balanced scorecard approach.
Gash, Deborah J; Hatton, Todd
2007-01-01
Ensuring that the information technology department is aligned with the overall health system strategy and is performing at a consistently high level is a priority at Saint Luke's Health System in Kansas City, Mo. The information technology department of Saint Luke's Health System has been using the balanced scorecard approach described in this article to measure and monitor its performance for four years. This article will review the structure of the IT department's scorecard; the categories and measures used; how benchmarks are determined; how linkage to the organizational scorecard is made; how results are reported; how changes are made to the scorecard; and tips for using a scorecard in other IT departments.
Automatic Data Traffic Control on DSM Architecture
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry; Kwak, Dochan (Technical Monitor)
2000-01-01
We study data traffic on distributed shared memory machines and conclude that data placement and grouping improve performance of scientific codes. We present several methods which user can employ to improve data traffic in his code. We report on implementation of a tool which detects the code fragments causing data congestions and advises user on improvements of data routing in these fragments. The capabilities of the tool include deduction of data alignment and affinity from the source code; detection of the code constructs having abnormally high cache or TLB misses; generation of data placement constructs. We demonstrate the capabilities of the tool on experiments with NAS parallel benchmarks and with a simple computational fluid dynamics application ARC3D.
Craig, G.D.; Glass, R.; Rupp, B.
1997-01-28
A method is disclosed for forming synthetic crystals of proteins in a carrier fluid by use of the dipole moments of protein macromolecules that self-align in the Helmholtz layer adjacent to an electrode. The voltage gradients of such layers easily exceed 10{sup 6}V/m. The synthetic protein crystals are subjected to x-ray crystallography to determine the conformational structure of the protein involved. 2 figs.
Estimation of relative effectiveness of phylogenetic programs by machine learning.
Krivozubov, Mikhail; Goebels, Florian; Spirin, Sergei
2014-04-01
Reconstruction of phylogeny of a protein family from a sequence alignment can produce results of different quality. Our goal is to predict the quality of phylogeny reconstruction basing on features that can be extracted from the input alignment. We used Fitch-Margoliash (FM) method of phylogeny reconstruction and random forest as a predictor. For training and testing the predictor, alignments of orthologous series (OS) were used, for which the result of phylogeny reconstruction can be evaluated by comparison with trees of corresponding organisms. Our results show that the quality of phylogeny reconstruction can be predicted with more than 80% precision. Also, we tried to predict which phylogeny reconstruction method, FM or UPGMA, is better for a particular alignment. With the used set of features, among alignments for which the obtained predictor predicts a better performance of UPGMA, 56% really give a better result with UPGMA. Taking into account that in our testing set only for 34% alignments UPGMA performs better, this result shows a principal possibility to predict the better phylogeny reconstruction method basing on features of a sequence alignment.
Hal: an automated pipeline for phylogenetic analyses of genomic data.
Robbertse, Barbara; Yoder, Ryan J; Boyd, Alex; Reeves, John; Spatafora, Joseph W
2011-02-07
The rapid increase in genomic and genome-scale data is resulting in unprecedented levels of discrete sequence data available for phylogenetic analyses. Major analytical impasses exist, however, prior to analyzing these data with existing phylogenetic software. Obstacles include the management of large data sets without standardized naming conventions, identification and filtering of orthologous clusters of proteins or genes, and the assembly of alignments of orthologous sequence data into individual and concatenated super alignments. Here we report the production of an automated pipeline, Hal that produces multiple alignments and trees from genomic data. These alignments can be produced by a choice of four alignment programs and analyzed by a variety of phylogenetic programs. In short, the Hal pipeline connects the programs BLASTP, MCL, user specified alignment programs, GBlocks, ProtTest and user specified phylogenetic programs to produce species trees. The script is available at sourceforge (http://sourceforge.net/projects/bio-hal/). The results from an example analysis of Kingdom Fungi are briefly discussed.
A Linked Series of Laboratory Exercises in Molecular Biology Utilizing Bioinformatics and GFP
ERIC Educational Resources Information Center
Medin, Carey L.; Nolin, Katie L.
2011-01-01
Molecular biologists commonly use bioinformatics to map and analyze DNA and protein sequences and to align different DNA and protein sequences for comparison. Additionally, biologists can create and view 3D models of protein structures to further understand intramolecular interactions. The primary goal of this 10-week laboratory was to introduce…
Automated batch fiducial-less tilt-series alignment in Appion using Protomo
Noble, Alex J.; Stagg, Scott M.
2015-01-01
The field of electron tomography has benefited greatly from manual and semi-automated approaches to marker-based tilt-series alignment that have allowed for the structural determination of multitudes of in situ cellular structures as well as macromolecular structures of individual protein complexes. The emergence of complementary metal-oxide semiconductor detectors capable of detecting individual electrons has enabled the collection of low dose, high contrast images, opening the door for reliable correlation-based tilt-series alignment. Here we present a set of automated, correlation-based tilt-series alignment, contrast transfer function (CTF) correction, and reconstruction workflows for use in conjunction with the Appion/Leginon package that are primarily targeted at automating structure determination with cryogenic electron microscopy. PMID:26455557
Validation of adenosine triphosphate to audit manual cleaning of flexible endoscope channels.
Alfa, Michelle J; Fatima, Iram; Olson, Nancy
2013-03-01
Compliance with cleaning of flexible endoscope channels cannot be verified using visual inspection. Adenosine triphosphate (ATP) has been suggested as a possible rapid cleaning monitor for flexible endoscope channels. There have not been published validation studies to specify the level of ATP that indicates inadequate cleaning has been achieved. The objective of this study was to validate the Clean-Trace (3M Inc, St. Paul, MN) ATP water test method for monitoring manual cleaning of flexible endoscopes. This was a simulated use study using a duodenoscope as the test device. Artificial test soil containing 10(6) colony-forming units of Pseudomonas aeruginosa and Enterococcus faecalis was used to perfuse all channels. The flush sample method for the suction-biopsy (L1) or air-water channel (L2) using 40 and 20 mLs sterile reverse osmosis water, respectively, was validated. Residuals of ATP, protein, hemoglobin, and bioburden were quantitated from channel samples taken from uncleaned, partially cleaned, and fully cleaned duodenoscopes. The benchmarks for clean were as follows: <6.4 μg/cm(2) protein, <2.2 μg/cm(2) hemoglobin, and <4-log10 colony-forming units/cm(2) bioburden. The average ATP in clean channel samples was 27.7 RLUs and 154 RLUs for L1 and L2, respectively (<200 RLUs for all channels). The average protein, hemoglobin, and bioburden benchmarks were achieved if <200 RLUs were detected. If the channel sample was >200 RLUs, the residual organic and bioburden levels would exceed the acceptable benchmarks. Our data validated that flexible endoscopes that have complete manual cleaning will have <200 RLUs by the Clean-Trace ATP test. Copyright © 2013 Association for Professionals in Infection Control and Epidemiology, Inc. Published by Mosby, Inc. All rights reserved.
Fee, Timothy; Surianarayanan, Swetha; Downs, Crawford; Zhou, Yong; Berry, Joel
2016-01-01
To examine the influence of substrate topology on the behavior of fibroblasts, tissue engineering scaffolds were electrospun from polycaprolactone (PCL) and a blend of PCL and gelatin (PCL+Gel) to produce matrices with both random and aligned nanofibrous orientations. The addition of gelatin to the scaffold was shown to increase the hydrophilicity of the PCL matrix and to increase the proliferation of NIH3T3 cells compared to scaffolds of PCL alone. The orientation of nanofibers within the matrix did not have an effect on the proliferation of adherent cells, but cells on aligned substrates were shown to elongate and align parallel to the direction of substrate fiber alignment. A microarray of cyotoskeleton regulators was probed to examine differences in gene expression between cells grown on an aligned and randomly oriented substrates. It was found that transcriptional expression of eight genes was statistically different between the two conditions, with all of them being upregulated in the aligned condition. The proteins encoded by these genes are linked to production and polymerization of actin microfilaments, as well as focal adhesion assembly. Taken together, the data indicates NIH3T3 fibroblasts on aligned substrates align themselves parallel with their substrate and increase production of actin and focal adhesion related genes.
Yu, Jia; Blom, Jochen; Sczyrba, Alexander; Goesmann, Alexander
2017-09-10
The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy. Copyright © 2017 The Author(s). Published by Elsevier B.V. All rights reserved.
Wang, Hsin-Wei; Hsu, Yen-Chu; Hwang, Jenn-Kang; Lyu, Ping-Chiang; Pai, Tun-Wen; Tang, Chuan Yi
2010-01-01
This work presents a novel detection method for three-dimensional domain swapping (DS), a mechanism for forming protein quaternary structures that can be visualized as if monomers had “opened” their “closed” structures and exchanged the opened portion to form intertwined oligomers. Since the first report of DS in the mid 1990s, an increasing number of identified cases has led to the postulation that DS might occur in a protein with an unconstrained terminus under appropriate conditions. DS may play important roles in the molecular evolution and functional regulation of proteins and the formation of depositions in Alzheimer's and prion diseases. Moreover, it is promising for designing auto-assembling biomaterials. Despite the increasing interest in DS, related bioinformatics methods are rarely available. Owing to a dramatic conformational difference between the monomeric/closed and oligomeric/open forms, conventional structural comparison methods are inadequate for detecting DS. Hence, there is also a lack of comprehensive datasets for studying DS. Based on angle-distance (A-D) image transformations of secondary structural elements (SSEs), specific patterns within A-D images can be recognized and classified for structural similarities. In this work, a matching algorithm to extract corresponding SSE pairs from A-D images and a novel DS score have been designed and demonstrated to be applicable to the detection of DS relationships. The Matthews correlation coefficient (MCC) and sensitivity of the proposed DS-detecting method were higher than 0.81 even when the sequence identities of the proteins examined were lower than 10%. On average, the alignment percentage and root-mean-square distance (RMSD) computed by the proposed method were 90% and 1.8Å for a set of 1,211 DS-related pairs of proteins. The performances of structural alignments remain high and stable for DS-related homologs with less than 10% sequence identities. In addition, the quality of its hinge loop determination is comparable to that of manual inspection. This method has been implemented as a web-based tool, which requires two protein structures as the input and then the type and/or existence of DS relationships between the input structures are determined according to the A-D image-based structural alignments and the DS score. The proposed method is expected to trigger large-scale studies of this interesting structural phenomenon and facilitate related applications. PMID:20976204
ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes
Otto, Thomas Dan; Catanho, Marcos; Tristão, Cristian; Bezerra, Márcia; Fernandes, Renan Mathias; Elias, Guilherme Steinberger; Scaglia, Alexandre Capeletto; Bovermann, Bill; Berstis, Viktors; Lifschitz, Sergio; de Miranda, Antonio Basílio; Degrave, Wim
2010-01-01
Motivation: Many analyses in modern biological research are based on comparisons between biological sequences, resulting in functional, evolutionary and structural inferences. When large numbers of sequences are compared, heuristics are often used resulting in a certain lack of accuracy. In order to improve and validate results of such comparisons, we have performed radical all-against-all comparisons of 4 million protein sequences belonging to the RefSeq database, using an implementation of the Smith–Waterman algorithm. This extremely intensive computational approach was made possible with the help of World Community Grid™, through the Genome Comparison Project. The resulting database, ProteinWorldDB, which contains coordinates of pairwise protein alignments and their respective scores, is now made available. Users can download, compare and analyze the results, filtered by genomes, protein functions or clusters. ProteinWorldDB is integrated with annotations derived from Swiss-Prot, Pfam, KEGG, NCBI Taxonomy database and gene ontology. The database is a unique and valuable asset, representing a major effort to create a reliable and consistent dataset of cross-comparisons of the whole protein content encoded in hundreds of completely sequenced genomes using a rigorous dynamic programming approach. Availability: The database can be accessed through http://proteinworlddb.org Contact: otto@fiocruz.br PMID:20089515
NASA Astrophysics Data System (ADS)
Nagy, Julia; Eilert, Tobias; Michaelis, Jens
2018-03-01
Modern hybrid structural analysis methods have opened new possibilities to analyze and resolve flexible protein complexes where conventional crystallographic methods have reached their limits. Here, the Fast-Nano-Positioning System (Fast-NPS), a Bayesian parameter estimation-based analysis method and software, is an interesting method since it allows for the localization of unknown fluorescent dye molecules attached to macromolecular complexes based on single-molecule Förster resonance energy transfer (smFRET) measurements. However, the precision, accuracy, and reliability of structural models derived from results based on such complex calculation schemes are oftentimes difficult to evaluate. Therefore, we present two proof-of-principle benchmark studies where we use smFRET data to localize supposedly unknown positions on a DNA as well as on a protein-nucleic acid complex. Since we use complexes where structural information is available, we can compare Fast-NPS localization to the existing structural data. In particular, we compare different dye models and discuss how both accuracy and precision can be optimized.
NASA Astrophysics Data System (ADS)
Thompson, Brianna C.; Chen, Jun; Moulton, Simon E.; Wallace, Gordon G.
2010-04-01
An aligned CNT array membrane electrode has been used as a nanostructured supporting platform for polypyrrole (PPy) films, exhibiting significant improvement in the controlled release of neurotrophin. In terms of linearity of release, stimulated to unstimulated control of NT-3 release and increased mass and % release of incorporated NT-3, the nanostructured material performed more favourably than the flat PPy film.
Bioinformatic prediction and in vivo validation of residue-residue interactions in human proteins
NASA Astrophysics Data System (ADS)
Jordan, Daniel; Davis, Erica; Katsanis, Nicholas; Sunyaev, Shamil
2014-03-01
Identifying residue-residue interactions in protein molecules is important for understanding both protein structure and function in the context of evolutionary dynamics and medical genetics. Such interactions can be difficult to predict using existing empirical or physical potentials, especially when residues are far from each other in sequence space. Using a multiple sequence alignment of 46 diverse vertebrate species we explore the space of allowed sequences for orthologous protein families. Amino acid changes that are known to damage protein function allow us to identify specific changes that are likely to have interacting partners. We fit the parameters of the continuous-time Markov process used in the alignment to conclude that these interactions are primarily pairwise, rather than higher order. Candidates for sites under pairwise epistasis are predicted, which can then be tested by experiment. We report the results of an initial round of in vivo experiments in a zebrafish model that verify the presence of multiple pairwise interactions predicted by our model. These experimentally validated interactions are novel, distant in sequence, and are not readily explained by known biochemical or biophysical features.
Xu, Dong; Zhang, Jian; Roy, Ambrish; Zhang, Yang
2011-01-01
I-TASSER is an automated pipeline for protein tertiary structure prediction using multiple threading alignments and iterative structure assembly simulations. In CASP9 experiments, two new algorithms, QUARK and FG-MD, were added to the I-TASSER pipeline for improving the structural modeling accuracy. QUARK is a de novo structure prediction algorithm used for structure modeling of proteins that lack detectable template structures. For distantly homologous targets, QUARK models are found useful as a reference structure for selecting good threading alignments and guiding the I-TASSER structure assembly simulations. FG-MD is an atomic-level structural refinement program that uses structural fragments collected from the PDB structures to guide molecular dynamics simulation and improve the local structure of predicted model, including hydrogen-bonding networks, torsion angles and steric clashes. Despite considerable progress in both the template-based and template-free structure modeling, significant improvements on protein target classification, domain parsing, model selection, and ab initio folding of beta-proteins are still needed to further improve the I-TASSER pipeline. PMID:22069036
SnapDock—template-based docking by Geometric Hashing
Estrin, Michael; Wolfson, Haim J.
2017-01-01
Abstract Motivation: A highly efficient template-based protein–protein docking algorithm, nicknamed SnapDock, is presented. It employs a Geometric Hashing-based structural alignment scheme to align the target proteins to the interfaces of non-redundant protein–protein interface libraries. Docking of a pair of proteins utilizing the 22 600 interface PIFACE library is performed in < 2 min on the average. A flexible version of the algorithm allowing hinge motion in one of the proteins is presented as well. Results: To evaluate the performance of the algorithm a blind re-modelling of 3547 PDB complexes, which have been uploaded after the PIFACE publication has been performed with success ratio of about 35%. Interestingly, a similar experiment with the template free PatchDock docking algorithm yielded a success rate of about 23% with roughly 1/3 of the solutions different from those of SnapDock. Consequently, the combination of the two methods gave a 42% success ratio. Availability and implementation: A web server of the application is under development. Contact: michaelestrin@gmail.com or wolfson@tau.ac.il PMID:28881968
Nisius, Britta; Gohlke, Holger
2012-09-24
Analyzing protein binding sites provides detailed insights into the biological processes proteins are involved in, e.g., into drug-target interactions, and so is of crucial importance in drug discovery. Herein, we present novel alignment-independent binding site descriptors based on DrugScore potential fields. The potential fields are transformed to a set of information-rich descriptors using a series expansion in 3D Zernike polynomials. The resulting Zernike descriptors show a promising performance in detecting similarities among proteins with low pairwise sequence identities that bind identical ligands, as well as within subfamilies of one target class. Furthermore, the Zernike descriptors are robust against structural variations among protein binding sites. Finally, the Zernike descriptors show a high data compression power, and computing similarities between binding sites based on these descriptors is highly efficient. Consequently, the Zernike descriptors are a useful tool for computational binding site analysis, e.g., to predict the function of novel proteins, off-targets for drug candidates, or novel targets for known drugs.
Ma, Maggie P C; Robinson, Phillip J; Chircop, Megan
2013-01-01
Sorting nexin 9 (SNX9) and clathrin heavy chain (CHC) each have roles in mitosis during metaphase. Since the two proteins directly interact for their other cellular function in endocytosis we investigated whether they also interact for metaphase and operate on the same pathway. We report that SNX9 and CHC functionally interact during metaphase in a specific molecular pathway that contributes to stabilization of mitotic spindle kinetochore (K)-fibres for chromosome alignment and segregation. This function is independent of their endocytic role. SNX9 residues in the clathrin-binding low complexity domain are required for CHC association and for targeting both CHC and transforming acidic coiled-coil protein 3 (TACC3) to the mitotic spindle. Mutation of these sites to serine increases the metaphase plate width, indicating inefficient chromosome congression. Therefore SNX9 and CHC function in the same molecular pathway for chromosome alignment and segregation, which is dependent on their direct association.
Ma, Maggie P. C.; Robinson, Phillip J.; Chircop, Megan
2013-01-01
Sorting nexin 9 (SNX9) and clathrin heavy chain (CHC) each have roles in mitosis during metaphase. Since the two proteins directly interact for their other cellular function in endocytosis we investigated whether they also interact for metaphase and operate on the same pathway. We report that SNX9 and CHC functionally interact during metaphase in a specific molecular pathway that contributes to stabilization of mitotic spindle kinetochore (K)-fibres for chromosome alignment and segregation. This function is independent of their endocytic role. SNX9 residues in the clathrin-binding low complexity domain are required for CHC association and for targeting both CHC and transforming acidic coiled-coil protein 3 (TACC3) to the mitotic spindle. Mutation of these sites to serine increases the metaphase plate width, indicating inefficient chromosome congression. Therefore SNX9 and CHC function in the same molecular pathway for chromosome alignment and segregation, which is dependent on their direct association. PMID:23861900
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shvartsburg, Alexandre A.
2014-11-04
Biomacromolecules tend to assume numerous structures in solution or the gas phase. It has been possible to resolve disparate conformational families but not unique geometries within each, and drastic peak broadening has been the bane of protein analyses by chromatography, electrophoresis, and ion mobility spectrometry (IMS). The new differential IMS (FAIMS) approach using hydrogen-rich gases was recently found to separate conformers of a small protein ubiquitin with same peak width and resolving power up to ~400 as for peptides. Present work explores the reach of this approach for larger proteins, exemplified by cytochrome c and myoglobin. Resolution similar to thatmore » for ubiquitin was largely achieved with longer separations, while the onset of peak broadening and coalescence with shorter separations suggests the limitation of present technique to proteins under ~20 kDa. This capability may enable distinguishing whole proteins with differing residue sequences or localizations of posttranslational modifications. Small features at negative compensation voltages that markedly grow from cytochrome c to myoglobin indicate the dipole alignment of rare conformers in accord with theory, further supporting the concept of pendular macroions in FAIMS.« less
Fauzi, M B; Lokanathan, Y; Aminuddin, B S; Ruszymah, B H I; Chowdhury, S R
2016-11-01
Collagen is the most abundant extracellular matrix (ECM) protein in the human body, thus widely used in tissue engineering and subsequent clinical applications. This study aimed to extract collagen from ovine (Ovis aries) Achilles tendon (OTC), and to evaluate its physicochemical properties and its potential to fabricate thin film with collagen fibrils in a random or aligned orientation. Acid-solubilized protein was extracted from ovine Achilles tendon using 0.35M acetic acid, and 80% of extracted protein was measured as collagen. SDS-PAGE and mass spectrometry analysis revealed the presence of alpha 1 and alpha 2 chain of collagen type I (col I). Further analysis with Fourier transform infrared spectrometry (FTIR), X-ray diffraction (XRD) and energy dispersive X-ray spectroscopy (EDS) confirms the presence of triple helix structure of col I, similar to commercially available rat tail col I. Drying the OTC solution at 37°C resulted in formation of a thin film with randomly orientated collagen fibrils (random collagen film; RCF). Introduction of unidirectional mechanical intervention using a platform rocker prior to drying facilitated the fabrication of a film with aligned orientation of collagen fibril (aligned collagen film; ACF). It was shown that both RCF and ACF significantly enhanced human dermal fibroblast (HDF) attachment and proliferation than that on plastic surface. Moreover, cells were distributed randomly on RCF, but aligned with the direction of mechanical intervention on ACF. In conclusion, ovine tendon could be an alternative source of col I to fabricate scaffold for tissue engineering applications. Copyright © 2016 Elsevier B.V. All rights reserved.
Zhou, Ren-Bin; Lu, Hui-Meng; Liu, Jie; Shi, Jian-Yu; Zhu, Jing; Lu, Qin-Qin; Yin, Da-Chuan
2016-01-01
Recombinant expression of proteins has become an indispensable tool in modern day research. The large yields of recombinantly expressed proteins accelerate the structural and functional characterization of proteins. Nevertheless, there are literature reported that the recombinant proteins show some differences in structure and function as compared with the native ones. Now there have been more than 100,000 structures (from both recombinant and native sources) publicly available in the Protein Data Bank (PDB) archive, which makes it possible to investigate if there exist any proteins in the RCSB PDB archive that have identical sequence but have some difference in structures. In this paper, we present the results of a systematic comparative study of the 3D structures of identical naturally purified versus recombinantly expressed proteins. The structural data and sequence information of the proteins were mined from the RCSB PDB archive. The combinatorial extension (CE), FATCAT-flexible and TM-Align methods were employed to align the protein structures. The root-mean-square distance (RMSD), TM-score, P-value, Z-score, secondary structural elements and hydrogen bonds were used to assess the structure similarity. A thorough analysis of the PDB archive generated five-hundred-seventeen pairs of native and recombinant proteins that have identical sequence. There were no pairs of proteins that had the same sequence and significantly different structural fold, which support the hypothesis that expression in a heterologous host usually could fold correctly into their native forms.
Zhou, Ren-Bin; Lu, Hui-Meng; Liu, Jie; Shi, Jian-Yu; Zhu, Jing; Lu, Qin-Qin; Yin, Da-Chuan
2016-01-01
Recombinant expression of proteins has become an indispensable tool in modern day research. The large yields of recombinantly expressed proteins accelerate the structural and functional characterization of proteins. Nevertheless, there are literature reported that the recombinant proteins show some differences in structure and function as compared with the native ones. Now there have been more than 100,000 structures (from both recombinant and native sources) publicly available in the Protein Data Bank (PDB) archive, which makes it possible to investigate if there exist any proteins in the RCSB PDB archive that have identical sequence but have some difference in structures. In this paper, we present the results of a systematic comparative study of the 3D structures of identical naturally purified versus recombinantly expressed proteins. The structural data and sequence information of the proteins were mined from the RCSB PDB archive. The combinatorial extension (CE), FATCAT-flexible and TM-Align methods were employed to align the protein structures. The root-mean-square distance (RMSD), TM-score, P-value, Z-score, secondary structural elements and hydrogen bonds were used to assess the structure similarity. A thorough analysis of the PDB archive generated five-hundred-seventeen pairs of native and recombinant proteins that have identical sequence. There were no pairs of proteins that had the same sequence and significantly different structural fold, which support the hypothesis that expression in a heterologous host usually could fold correctly into their native forms. PMID:27517583
On the relationship between tumour growth rate and survival in non-small cell lung cancer.
Mistry, Hitesh B
2017-01-01
A recurrent question within oncology drug development is predicting phase III outcome for a new treatment using early clinical data. One approach to tackle this problem has been to derive metrics from mathematical models that describe tumour size dynamics termed re-growth rate and time to tumour re-growth. They have shown to be strong predictors of overall survival in numerous studies but there is debate about how these metrics are derived and if they are more predictive than empirical end-points. This work explores the issues raised in using model-derived metric as predictors for survival analyses. Re-growth rate and time to tumour re-growth were calculated for three large clinical studies by forward and reverse alignment. The latter involves re-aligning patients to their time of progression. Hence, it accounts for the time taken to estimate re-growth rate and time to tumour re-growth but also assesses if these predictors correlate to survival from the time of progression. I found that neither re-growth rate nor time to tumour re-growth correlated to survival using reverse alignment. This suggests that the dynamics of tumours up until disease progression has no relationship to survival post progression. For prediction of a phase III trial I found the metrics performed no better than empirical end-points. These results highlight that care must be taken when relating dynamics of tumour imaging to survival and that bench-marking new approaches to existing ones is essential.
Fast-SG: an alignment-free algorithm for hybrid assembly.
Di Genova, Alex; Ruz, Gonzalo A; Sagot, Marie-France; Maass, Alejandro
2018-05-01
Long-read sequencing technologies are the ultimate solution for genome repeats, allowing near reference-level reconstructions of large genomes. However, long-read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods that combine short- and long-read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes. Here, we propose a new method, called Fast-SG, that uses a new ultrafast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures. Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short-read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short-read aligners when building the scaffoldinggraph and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long-read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878). Fast-SG opens a door to achieve accurate hybrid long-range reconstructions of large genomes with low effort, high portability, and low cost.
Automated batch fiducial-less tilt-series alignment in Appion using Protomo.
Noble, Alex J; Stagg, Scott M
2015-11-01
The field of electron tomography has benefited greatly from manual and semi-automated approaches to marker-based tilt-series alignment that have allowed for the structural determination of multitudes of in situ cellular structures as well as macromolecular structures of individual protein complexes. The emergence of complementary metal-oxide semiconductor detectors capable of detecting individual electrons has enabled the collection of low dose, high contrast images, opening the door for reliable correlation-based tilt-series alignment. Here we present a set of automated, correlation-based tilt-series alignment, contrast transfer function (CTF) correction, and reconstruction workflows for use in conjunction with the Appion/Leginon package that are primarily targeted at automating structure determination with cryogenic electron microscopy. Copyright © 2015 Elsevier Inc. All rights reserved.
Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking
2012-01-01
A key metric to assess molecular docking remains ligand enrichment against challenging decoys. Whereas the directory of useful decoys (DUD) has been widely used, clear areas for optimization have emerged. Here we describe an improved benchmarking set that includes more diverse targets such as GPCRs and ion channels, totaling 102 proteins with 22886 clustered ligands drawn from ChEMBL, each with 50 property-matched decoys drawn from ZINC. To ensure chemotype diversity, we cluster each target’s ligands by their Bemis–Murcko atomic frameworks. We add net charge to the matched physicochemical properties and include only the most dissimilar decoys, by topology, from the ligands. An online automated tool (http://decoys.docking.org) generates these improved matched decoys for user-supplied ligands. We test this data set by docking all 102 targets, using the results to improve the balance between ligand desolvation and electrostatics in DOCK 3.6. The complete DUD-E benchmarking set is freely available at http://dude.docking.org. PMID:22716043
A global optimization algorithm for protein surface alignment
2010-01-01
Background A relevant problem in drug design is the comparison and recognition of protein binding sites. Binding sites recognition is generally based on geometry often combined with physico-chemical properties of the site since the conformation, size and chemical composition of the protein surface are all relevant for the interaction with a specific ligand. Several matching strategies have been designed for the recognition of protein-ligand binding sites and of protein-protein interfaces but the problem cannot be considered solved. Results In this paper we propose a new method for local structural alignment of protein surfaces based on continuous global optimization techniques. Given the three-dimensional structures of two proteins, the method finds the isometric transformation (rotation plus translation) that best superimposes active regions of two structures. We draw our inspiration from the well-known Iterative Closest Point (ICP) method for three-dimensional (3D) shapes registration. Our main contribution is in the adoption of a controlled random search as a more efficient global optimization approach along with a new dissimilarity measure. The reported computational experience and comparison show viability of the proposed approach. Conclusions Our method performs well to detect similarity in binding sites when this in fact exists. In the future we plan to do a more comprehensive evaluation of the method by considering large datasets of non-redundant proteins and applying a clustering technique to the results of all comparisons to classify binding sites. PMID:20920230
Light-induced quantitative microprinting of biomolecules
NASA Astrophysics Data System (ADS)
Strale, Pierre-Olivier; Azioune, Ammar; Bugnicourt, Ghislain; Lecomte, Yohan; Chahid, Makhlad; Studer, Vincent
2017-02-01
Printing of biomolecules on substrates has developed tremendously in the past few years. The existing methods either rely on slow serial writing processes or on parallelized photolithographic techniques where cumbersome mask alignment procedures usually impair the ability to generate multi-protein patterns. We recently developed a new technology allowing for high resolution multi protein micro-patterning. This technology named "Light-Induced Molecular Adsorption of Proteins (LIMAP)" is based on a water-soluble photo-initiator able to reverse the antifouling property of polymer brushes when exposed to UV light. We developed a wide-field pattern projection system based on a DMD coupled to a conventional microscope which permits to generate arbitrary grayscale patterns of UV light at the micron scale. Interestingly, the density of adsorbed molecules scales with the dose of UV light thus allowing the quantitative patterning of biomolecules. The very low non specific background of biomolecules outside of the UV-exposed areas allows for the sequential printing of multiple proteins without alignment procedures. Protein patterns ranging from 500 nm up to 1 mm can be performed within seconds, as well as gradients of arbitrary shapes. The range of applications of the LIMAP approach extends from the single molecule up to the multicellular scale with an exquisite control over local protein density. We show that it can be used to generate complex protein landscapes useful to study protein-protein, cell-cell and cell-matrix interactions.
Sousa, Filipa L; Parente, Daniel J; Hessman, Jacob A; Chazelle, Allen; Teichmann, Sarah A; Swint-Kruse, Liskin
2016-09-01
The AlloRep database (www.AlloRep.org) (Sousa et al., 2016) [1] compiles extensive sequence, mutagenesis, and structural information for the LacI/GalR family of transcription regulators. Sequence alignments are presented for >3000 proteins in 45 paralog subfamilies and as a subsampled alignment of the whole family. Phenotypic and biochemical data on almost 6000 mutants have been compiled from an exhaustive search of the literature; citations for these data are included herein. These data include information about oligomerization state, stability, DNA binding and allosteric regulation. Protein structural data for 65 proteins are presented as easily-accessible, residue-contact networks. Finally, this article includes example queries to enable the use of the AlloRep database. See the related article, "AlloRep: a repository of sequence, structural and mutagenesis data for the LacI/GalR transcription regulators" (Sousa et al., 2016) [1].
Using naturally occurring polysaccharides to align molecules with nonlinear optical activity
NASA Technical Reports Server (NTRS)
Prasthofer, Thomas
1996-01-01
The Biophysics and Advanced Materials Branch of the Microgravity Science and Applications Division at Marshall Space Flight Center has been investigating polymers with the potential for nonlinear optical (NLO) applications for a number of years. Some of the potential applications for NLO materials include optical communications, computing, and switching. To this point the branch's research has involved polydiacetylenes, phthalocyanins, and other synthetic polymers which have inherent NLO properties. The aim of the present research is to investigate the possibility of using naturally occurring polymers such as polysaccharides or proteins to trap and align small organic molecules with useful NLO properties. Ordering molecules with NLO properties enhances 3rd order nonlinear effects and is required for 2nd order nonlinear effects. Potential advantages of such a system are the flexibility to use different small molecules with varying chemical and optical properties, the stability and cost of the polymers, and the ability to form thin, optically transparent films. Since the quality of any polymer films depends on optimizing ordering and minimizing defects, this work is particularly well suited for microgravity experiments. Polysaccharide and protein polymers form microscopic crystallites which must align to form ordered arrays. The ordered association of crystallites is disrupted by gravity effects and NASA research on protein crystal growth has demonstrated that low gravity conditions can improve crystal quality.
Lei, Lei; Li, Shundai; Gu, Ying
2012-07-01
Cellulose is synthesized at the plasma membrane by protein complexes known as cellulose synthase complexes (CSCs). The cellulose-microtubule alignment hypothesis states that there is a causal link between the orientation of cortical microtubules and orientation of nascent cellulose microfibrils. The mechanism behind the alignment hypothesis is largely unknown. CESA interactive protein 1 (CSI1) interacts with CSCs and potentially links CSCs to the cytoskeleton. CSI1 not only co-localizes with CSCs but also travels bi-directionally in a speed indistinguishable from CSCs. The linear trajectories of CSI1-RFP coincide with the underlying microtubules labeled by YFP-TUA5. In the absence of CSI1, both the distribution and the motility of CSCs are defective and the alignment of CSCs and microtubules is disrupted. These observations led to the hypothesis that CSI1 directly mediates the interaction between CSCs and microtubules. In support of this hypothesis, CSI1 binds to microtubules directly by an in vitro microtubule-binding assay. In addition to a role in serving as a messenger from microtubule to CSCs, CSI1 labels SmaCCs/MASCs, a compartment that has been proposed to be involved in CESA trafficking and/or delivery to the plasma membrane.
Lei, Lei; Li, Shundai; Gu, Ying
2012-01-01
Cellulose is synthesized at the plasma membrane by protein complexes known as cellulose synthase complexes (CSCs). The cellulose-microtubule alignment hypothesis states that there is a causal link between the orientation of cortical microtubules and orientation of nascent cellulose microfibrils. The mechanism behind the alignment hypothesis is largely unknown. CESA interactive protein 1 (CSI1) interacts with CSCs and potentially links CSCs to the cytoskeleton. CSI1 not only co-localizes with CSCs but also travels bi-directionally in a speed indistinguishable from CSCs. The linear trajectories of CSI1-RFP coincide with the underlying microtubules labeled by YFP-TUA5. In the absence of CSI1, both the distribution and the motility of CSCs are defective and the alignment of CSCs and microtubules is disrupted. These observations led to the hypothesis that CSI1 directly mediates the interaction between CSCs and microtubules. In support of this hypothesis, CSI1 binds to microtubules directly by an in vitro microtubule-binding assay. In addition to a role in serving as a messenger from microtubule to CSCs, CSI1 labels SmaCCs/MASCs, a compartment that has been proposed to be involved in CESA trafficking and/or delivery to the plasma membrane. PMID:22751327