Yachdav, Guy; Wilzbach, Sebastian; Rauscher, Benedikt; Sheridan, Robert; Sillitoe, Ian; Procter, James; Lewis, Suzanna E; Rost, Burkhard; Goldberg, Tatyana
Martin, Andrew C R
Martin, Andrew C. R.
Gille, Christoph; Birgit, Weyand; Gille, Andreas
Troshin, Peter V; Procter, James B; Barton, Geoffrey J
JABAWS is a web services framework that simplifies the deployment of web services for bioinformatics. JABAWS:MSA provides services for five multiple sequence alignment (MSA) methods (Probcons, T-coffee, Muscle, Mafft and ClustalW), and is the system employed by the Jalview multiple sequence analysis workbench since version 2.6. A fully functional, easy to set up server is provided as a Virtual Appliance (VA), which can be run on most operating systems that support a virtualization environment such as VMware or Oracle VirtualBox. JABAWS is also distributed as a Web Application aRchive (WAR) and can be configured to run on a single computer and/or a cluster managed by Grid Engine, LSF or other queuing systems that support DRMAA. JABAWS:MSA provides clients full access to each application's parameters, allows administrators to specify named parameter preset combinations and execution limits for each application through simple configuration files. The JABAWS command-line client allows integration of JABAWS services into conventional scripts. JABAWS is made freely available under the Apache 2 license and can be obtained from: http://www.compbio.dundee.ac.uk/jabaws.
Yang, Ye; Liu, Juan
We developed a program JVM (Java Visual Mapping) for mapping next generation sequencing read to reference sequence. The program is implemented in Java and is designed to deal with millions of short read generated by sequence alignment using the Illumina sequencing technology. It employs seed index strategy and octal encoding operations for sequence alignments. JVM is useful for DNA-Seq, RNA-Seq when dealing with single-end resequencing. JVM is a desktop application, which supports reads capacity from 1 MB to 10 GB.
Perry, William L
Bawono, Punto; Dijkstra, Maurits; Pirovano, Walter; Feenstra, Anton; Abeln, Sanne; Heringa, Jaap
The increasing importance of Next Generation Sequencing (NGS) techniques has highlighted the key role of multiple sequence alignment (MSA) in comparative structure and function analysis of biological sequences. MSA often leads to fundamental biological insight into sequence-structure-function relationships of nucleotide or protein sequence families. Significant advances have been achieved in this field, and many useful tools have been developed for constructing alignments, although many biological and methodological issues are still open. This chapter first provides some background information and considerations associated with MSA techniques, concentrating on the alignment of protein sequences. Then, a practical overview of currently available methods and a description of their specific advantages and limitations are given, to serve as a helpful guide or starting point for researchers who aim to construct a reliable MSA.
Jeff Daily, PNNL
Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, a novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.
Gille, Christoph; Fähling, Michael; Weyand, Birgit; Wieland, Thomas; Gille, Andreas
Alignment-Annotator is a novel web service designed to generate interactive views of annotated nucleotide and amino acid sequence alignments (i) de novo and (ii) embedded in other software. All computations are performed at server side. Interactivity is implemented in HTML5, a language native to web browsers. The alignment is initially displayed using default settings and can be modified with the graphical user interfaces. For example, individual sequences can be reordered or deleted using drag and drop, amino acid color code schemes can be applied and annotations can be added. Annotations can be made manually or imported (BioDAS servers, the UniProt, the Catalytic Site Atlas and the PDB). Some edits take immediate effect while others require server interaction and may take a few seconds to execute. The final alignment document can be downloaded as a zip-archive containing the HTML files. Because of the use of HTML the resulting interactive alignment can be viewed on any platform including Windows, Mac OS X, Linux, Android and iOS in any standard web browser. Importantly, no plugins nor Java are required and therefore Alignment-Anotator represents the first interactive browser-based alignment visualization. http://www.bioinformatics.org/strap/aa/ and http://strap.charite.de/aa/. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Gille, Christoph; Fähling, Michael; Weyand, Birgit; Wieland, Thomas; Gille, Andreas
Alignment-Annotator is a novel web service designed to generate interactive views of annotated nucleotide and amino acid sequence alignments (i) de novo and (ii) embedded in other software. All computations are performed at server side. Interactivity is implemented in HTML5, a language native to web browsers. The alignment is initially displayed using default settings and can be modified with the graphical user interfaces. For example, individual sequences can be reordered or deleted using drag and drop, amino acid color code schemes can be applied and annotations can be added. Annotations can be made manually or imported (BioDAS servers, the UniProt, the Catalytic Site Atlas and the PDB). Some edits take immediate effect while others require server interaction and may take a few seconds to execute. The final alignment document can be downloaded as a zip-archive containing the HTML files. Because of the use of HTML the resulting interactive alignment can be viewed on any platform including Windows, Mac OS X, Linux, Android and iOS in any standard web browser. Importantly, no plugins nor Java are required and therefore Alignment-Anotator represents the first interactive browser-based alignment visualization. Availability: http://www.bioinformatics.org/strap/aa/ and http://strap.charite.de/aa/. PMID:24813445
Guo, Tao; Li, Guiyang; Deaton, Russel
Sequence similarity and alignment are most important operations in computational biology. However, analyzing large sets of DNA sequence seems to be impractical on a regular PC. Using multiple threads with JavaParty mechanism, this project has successfully implemented in extending the capabilities of regular Java to a distributed environment for simulation of DNA computation. With the aid of JavaParty and the design of multiple threads, the results of this study demonstrated that the modified regular Java program could perform parallel computing without using RMI or socket communication. In this paper, an efficient method for modeling and comparing DNA sequences with dynamic programming and JavaParty was firstly proposed. Additionally, results of this method in distributed environment have been discussed.
Lu, Yue; Sze, Sing-Hoi
Despite considerable efforts, it remains difficult to obtain accurate multiple sequence alignments. By using additional hits from database search of the input sequences, a few strategies have been proposed to significantly improve alignment accuracy, including the construction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of intermediate sequence search to link distant homologs, and the use of secondary structure information. We develop an algorithm that integrates these strategies to further improve alignment accuracy by modifying the pair-Hidden Markov Model (HMM) approach in ProbCons to incorporate profiles of intermediate sequences from database search and utilize secondary structure predictions as in SPEM. We test our algorithm on a few sets of benchmark multiple alignments, including BAliBASE, HOMSTRAD, PREFAB, and SABmark, and show that it significantly outperforms MAFFT and ProbCons, which are among the best multiple alignment algorithms that do not utilize additional information, and SPEM, which is among the best multiple alignment algorithms that utilize additional hits from database search. The improvement in accuracy over SPEM can be as much as 5-10% when aligning divergent sequences. A software program that implements this approach (ISPAlign) is available at http://faculty.cs.tamu.edu/shsze/ispalign.
García-Alcalde, Fernando; Okonechnikov, Konstantin; Carbonell, José; Cruz, Luis M; Götz, Stefan; Tarazona, Sonia; Dopazo, Joaquín; Meyer, Thomas F; Conesa, Ana
The sequence alignment/map (SAM) and the binary alignment/map (BAM) formats have become the standard method of representation of nucleotide sequence alignments for next-generation sequencing data. SAM/BAM files usually contain information from tens to hundreds of millions of reads. Often, the sequencing technology, protocol and/or the selected mapping algorithm introduce some unwanted biases in these data. The systematic detection of such biases is a non-trivial task that is crucial to drive appropriate downstream analyses. We have developed Qualimap, a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties. Qualimap takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data. Such quality-control data are vital for highlighting problems in the sequencing and/or mapping processes, which must be addressed prior to further analyses. Qualimap is freely available from http://www.qualimap.org.
Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modification of sequences proceeds through any of the operations of substitution, insertion or deletion (the latter two collectively termed indels). While this model has worked farily well, it has long been apparent that other mutational events occur. In this paper, we introduce a new model, the DSI model which includes another common mutational event, tandem duplication. Tandem duplication produces tandem repeats which are common in DNA, making up perhaps 10% of the human genome. They are responsible for some human diseases and may serve a multitude of functions in DNA regulation and evolution. Using the DSI model, we develop new exact and heuristic algorithms for comparing and aligning DNA sequences when they contain tandem repeats. 30 refs., 3 figs.
Bawono, Punto; Heringa, Jaap
Profile ALIgNmEnt (PRALINE) is a versatile multiple sequence alignment toolkit. In its main alignment protocol, PRALINE follows the global progressive alignment algorithm. It provides various alignment optimization strategies to address the different situations that call for protein multiple sequence alignment: global profile preprocessing, homology-extended alignment, secondary structure-guided alignment, and transmembrane aware alignment. A number of combinations of these strategies are enabled as well. PRALINE is accessible via the online server http://www.ibi.vu.nl/programs/PRALINEwww/. The server facilitates extensive visualization possibilities aiding the interpretation of alignments generated, which can be written out in pdf format for publication purposes. PRALINE also allows the sequences in the alignment to be represented in a dendrogram to show their mutual relationships according to the alignment. The chapter ends with a discussion of various issues occurring in multiple sequence alignment.
Torarinsson, Elfar; Havgaard, Jakob H; Gorodkin, Jan
An apparent paradox in computational RNA structure prediction is that many methods, in advance, require a multiple alignment of a set of related sequences, when searching for a common structure between them. However, such a multiple alignment is hard to obtain even for few sequences with low sequence similarity without simultaneously folding and aligning them. Furthermore, it is of interest to conduct a multiple alignment of RNA sequence candidates found from searching as few as two genomic sequences. Here, based on the PMcomp program, we present a global multiple alignment program, foldalignM, which performs especially well on few sequences with low sequence similarity, and is comparable in performance with state of the art programs in general. In addition, it can cluster sequences based on sequence and structure similarity and output a multiple alignment for each cluster. Furthermore, preliminary results with local datasets indicate that the program is useful for post processing foldalign pairwise scans. The program foldalignM is implemented in JAVA and is, along with some accompanying PERL scripts, available at http://foldalign.ku.dk/
Wang, Shu; Gutell, Robin R; Miranker, Daniel P
Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address. We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions. BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c. PMID:2849754
Kim, J.; Pramanik, S.
Multiple sequence alignment has been a useful method in the study of molecular evolution and sequence-structure relationships. This paper presents a new method for multiple sequence alignment based on simulated annealing technique. Dynamic programming has been widely used to find an optimal alignment. However, dynamic programming has several limitations to obtain optimal alignment. It requires long computation time and cannot apply certain types of cost functions. We describe detail mechanisms of simulated annealing for multiple sequence alignment problem. It is shown that simulated annealing can be an effective approach to overcome the limitations of dynamic programming in multiple sequence alignment problem.
Boyce, Kieran; Sievers, Fabian; Higgins, Desmond G
Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time. We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced. This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.
Chakraborty, Angana; Bandyopadhyay, Sanghamitra
In this article we propose a Fast Optimal Global Sequence Alignment Algorithm, FOGSAA, which aligns a pair of nucleotide/protein sequences faster than any optimal global alignment method including the widely used Needleman-Wunsch (NW) algorithm. FOGSAA is applicable for all types of sequences, with any scoring scheme, and with or without affine gap penalty. Compared to NW, FOGSAA achieves a time gain of (70-90)% for highly similar nucleotide sequences (> 80% similarity), and (54-70)% for sequences having (30-80)% similarity. For other sequences, it terminates with an approximate score. For protein sequences, the average time gain is between (25-40)%. Compared to three heuristic global alignment methods, the quality of alignment is improved by about 23%-53%. FOGSAA is, in general, suitable for aligning any two sequences defined over a finite alphabet set, where the quality of the global alignment is of supreme importance.
Lyras, Dimitrios P; Metzler, Dirk
Obtaining an accurate sequence alignment is fundamental for consistently analyzing biological data. Although this problem may be efficiently solved when only two sequences are considered, the exact inference of the optimal alignment easily gets computationally intractable for the multiple sequence alignment case. To cope with the high computational expenses, approximate heuristic methods have been proposed that address the problem indirectly by progressively aligning the sequences in pairs according to their relatedness. These methods however are not flexible to change the alignment of an already aligned group of sequences in the view of new data, resulting thus in compromises on the quality of the deriving alignment. In this paper we present ReformAlign, a novel meta-alignment approach that may significantly improve on the quality of the deriving alignments from popular aligners. We call ReformAlign a meta-aligner as it requires an initial alignment, for which a variety of alignment programs can be used. The main idea behind ReformAlign is quite straightforward: at first, an existing alignment is used to construct a standard profile which summarizes the initial alignment and then all sequences are individually re-aligned against the formed profile. From each sequence-profile comparison, the alignment of each sequence against the profile is recorded and the final alignment is indirectly inferred by merging all the individual sub-alignments into a unified set. The employment of ReformAlign may often result in alignments which are significantly more accurate than the starting alignments. We evaluated the effect of ReformAlign on the generated alignments from ten leading alignment methods using real data of variable size and sequence identity. The experimental results suggest that the proposed meta-aligner approach may often lead to statistically significant more accurate alignments. Furthermore, we show that ReformAlign results in more substantial improvement in
Background The generation of multiple sequence alignments (MSAs) is a crucial step for many bioinformatic analyses. Thus improving MSA accuracy and identifying potential errors in MSAs is important for a wide range of post-genomic research. We present a novel method called MergeAlign which constructs consensus MSAs from multiple independent MSAs and assigns an alignment precision score to each column. Results Using conventional benchmark tests we demonstrate that on average MergeAlign MSAs are more accurate than MSAs generated using any single matrix of sequence substitution. We show that MergeAlign column scores are related to alignment precision and hence provide an ab initio method of estimating alignment precision in the absence of curated reference MSAs. Using two novel and independent alignment performance tests that utilise a large set of orthologous gene families we demonstrate that increasing MSA performance leads to an increase in the performance of downstream phylogenetic analyses. Conclusion Using multiple tests of alignment performance we demonstrate that this novel method has broad general application in biological research. PMID:22646090
Kruspe, Matthias; Stadler, Peter F
Background The quality of progressive sequence alignments strongly depends on the accuracy of the individual pairwise alignment steps since gaps that are introduced at one step cannot be removed at later aggregation steps. Adjacent insertions and deletions necessarily appear in arbitrary order in pairwise alignments and hence form an unavoidable source of errors. Research Here we present a modified variant of progressive sequence alignments that addresses both issues. Instead of pairwise alignments we use exact dynamic programming to align sequence or profile triples. This avoids a large fractions of the ambiguities arising in pairwise alignments. In the subsequent aggregation steps we follow the logic of the Neighbor-Net algorithm, which constructs a phylogenetic network by step-wisely replacing triples by pairs instead of combining pairs to singletons. To this end the three-way alignments are subdivided into two partial alignments, at which stage all-gap columns are naturally removed. This alleviates the "once a gap, always a gap" problem of progressive alignment procedures. Conclusion The three-way Neighbor-Net based alignment program aln3nn is shown to compare favorably on both protein sequences and nucleic acids sequences to other progressive alignment tools. In the latter case one easily can include scoring terms that consider secondary structure features. Overall, the quality of resulting alignments in general exceeds that of clustalw or other multiple alignments tools even though our software does not included heuristics for context dependent (mis)match scores. PMID:17631683
Background While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. Results We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to
Ayad, Lorraine A K; Pissis, Solon P
A fundamental assumption of all widely-used multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary due to a number of reasons: arbitrariness in the linearisation (sequencing) of a circular molecular structure; or inconsistencies introduced into sequence databases due to different linearisation standards. These scenarios are relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure. A solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; these refined sequences may in turn lead to improved multiple sequence alignments using the preferred multiple sequence alignment program. We present MARS, a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. MARS was implemented in the C++ programming language as a program to compute the rotations (cyclic shifts) required to best align a set of input sequences. Experimental results, using real and synthetic data, show that MARS improves the alignments, with respect to standard genetic measures and the inferred maximum-likelihood-based phylogenies, and outperforms state-of-the-art methods both in terms of accuracy and efficiency. Our results show, among others, that the average pairwise distance in the multiple sequence alignment of a dataset of widely-studied mitochondrial DNA sequences is reduced by around 5% when MARS is applied before a multiple sequence alignment is performed. Analysing multiple sequences simultaneously is fundamental in biological research and multiple sequence alignment has been found to be a popular method for this task. Conventional alignment techniques cannot be used effectively when the position where sequences start is arbitrary. We present
Khafizov, Kamil; Forrest, Lucy R.
Few sequence alignment methods have been designed specifically for integral membrane proteins, even though these important proteins have distinct evolutionary and structural properties that might affect their alignments. Existing approaches typically consider membrane-related information either by using membrane-specific substitution matrices or by assigning distinct penalties for gap creation in transmembrane and non-transmembrane regions. Here, we ask whether favoring matching of predicted transmembrane segments within a standard dynamic programming algorithm can improve the accuracy of pairwise membrane protein sequence alignments. We tested various strategies using a specifically designed program called AlignMe. An updated set of homologous membrane protein structures, called HOMEP2, was used as a reference for optimizing the gap penalties. The best of the membrane-protein optimized approaches were then tested on an independent reference set of membrane protein sequence alignments from the BAliBASE collection. When secondary structure (S) matching was combined with evolutionary information (using a position-specific substitution matrix (P)), in an approach we called AlignMePS, the resultant pairwise alignments were typically among the most accurate over a broad range of sequence similarities when compared to available methods. Matching transmembrane predictions (T), in addition to evolutionary information, and secondary-structure predictions, in an approach called AlignMePST, generally reduces the accuracy of the alignments of closely-related proteins in the BAliBASE set relative to AlignMePS, but may be useful in cases of extremely distantly related proteins for which sequence information is less informative. The open source AlignMe code is available at https://sourceforge.net/projects/alignme/, and at http://www.forrestlab.org, along with an online server and the HOMEP2 data set. PMID:23469223
Waldispühl, Jérôme; O'Donnell, Charles W.; Will, Sebastian; Devadas, Srinivas; Backofen, Rolf
Abstract Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at http://partifold.csail.mit.edu/). PMID:24766258
Cooper, Gregory M; Singaravelu, Senthil A G; Sidow, Arend
Alignment and comparison of related genome sequences is a powerful method to identify regions likely to contain functional elements. Such analyses are data intensive, requiring the inclusion of genomic multiple sequence alignments, sequence annotations, and scores describing regional attributes of columns in the alignment. Visualization and browsing of results can be difficult, and there are currently limited software options for performing this task. The Application for Browsing Constraints (ABC) is interactive Java software for intuitive and efficient exploration of multiple sequence alignments and data typically associated with alignments. It is used to move quickly from a summary view of the entire alignment via arbitrary levels of resolution to individual alignment columns. It allows for the simultaneous display of quantitative data, (e.g., sequence similarity or evolutionary rates) and annotation data (e.g. the locations of genes, repeats, and constrained elements). It can be used to facilitate basic comparative sequence tasks, such as export of data in plain-text formats, visualization of phylogenetic trees, and generation of alignment summary graphics. The ABC is a lightweight, stand-alone, and flexible graphical user interface for browsing genomic multiple sequence alignments of specific loci, up to hundreds of kilobases or a few megabases in length. It is coded in Java for cross-platform use and the program and source code are freely available under the General Public License. Documentation and a sample data set are also available http://mendel.stanford.edu/sidowlab/downloads.html.
Vishnepolsky, Boris; Pirtskhalava, Malak
The presented program ALIGN_MTX makes alignment of two textual sequences with an opportunity to use any several characters for the designation of sequence elements and arbitrary user substitution matrices. It can be used not only for the alignment of amino acid and nucleotide sequences but also for sequence-structure alignment used in threading, amino acid sequence alignment, using preliminary known PSSM matrix, and in other cases when alignment of biological or non-biological textual sequences is required. This distinguishes it from the majority of similar alignment programs that make, as a rule, alignment only of amino acid or nucleotide sequences represented as a sequence of single alphabetic characters. ALIGN_MTX is presented as downloadable zip archive at http://www.imbbp.org/software/ALIGN_MTX/ and available for free use. As application of using the program, the results of comparison of different types of substitution matrix for alignment quality in distantly related protein pair sets were presented. Threading matrix SORDIS, based on side-chain orientation in relation to hydrophobic core centers with evolutionary change-based substitution matrix BLOSUM and using multiple sequence alignment information position-specific score matrices (PSSM) were taken for test alignment accuracy. The best performance shows PSSM matrix, but in the reduced set with lower sequence similarity threading matrix SORDIS shows the same performance and it was shown that combined potential with SORDIS and PSSM can improve alignment quality in evolutionary distantly related protein pairs.
Hou, Minmei; Riemer, Cathy; Berman, Piotr; Hardison, Ross C.; Miller, Webb
It is difficult to properly align genomic sequences that contain intra-species duplications. With this goal in mind, we have developed a tool, called TOAST (two-way orthologous alignment selection tool), for predicting whether two aligned regions from different species are orthologous, i.e., separated by a speciation event, as opposed to a duplication event. The advantage of restricting alignment to orthologous pairs is that they constitute the aligning regions that are most likely to share the same biological function, and most easily analyzed for evidence of selection. We evaluate TOAST on 12 human/mouse gene clusters.
Campagne, F; Maigret, B
Protein sequence alignments are widely used in protein structure prediction, protein engineering, modeling of proteins, etc. This type of representation is useful at different stages of scientific activity: looking at previous results, working on a research project, and presenting the results. There is a need to make it available through a network (intranet or WWW), in a way that allows biologists, chemists, and noncomputer specialists to look at the data and carry on research--possibly in a collaborative research. Previous methods (text-based, Java-based) are reported and their advantages are discussed. We have developed two novel approaches to represent the alignments as colored, hyper-linked HTML pages. The first method creates an HTML page that uses efficiently the image cache mechanism of a WWW browser, thereby allowing the user to browse different alignments without waiting for the images to be loaded through the network, but only for the first viewed alignment. The generated pages can be browsed with any HTML2.0-compliant browser. The second method that we propose uses W3C-CSS1-style sheets to render alignments. This new method generates pages that require recent browsers to be viewed. We implemented these methods in the Viseur program and made a WWW service available that allows a user to convert an MSF alignment file in HTML for WWW publishing. The latter service is available at http:@www.lctn.u-nancy.fr/viseur/services.htm l.
Ogden, T Heath; Rosenberg, Michael S
Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiple-sequence alignment can be an important factor in downstream effects on topological reconstruction.
Abbasi, Maryam; Paquete, Luís; Pereira, Francisco B
Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment. We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show
Notredame, C; Higgins, D G
We describe a new approach to multiple sequence alignment using genetic algorithms and an associated software package called SAGA. The method involves evolving a population of alignments in a quasi evolutionary manner and gradually improving the fitness of the population as measured by an objective function which measures multiple alignment quality. SAGA uses an automatic scheduling scheme to control the usage of 22 different operators for combining alignments or mutating them between generations. When used to optimise the well known sums of pairs objective function, SAGA performs better than some of the widely used alternative packages. This is seen with respect to the ability to achieve an optimal solution and with regard to the accuracy of alignment by comparison with reference alignments based on sequences of known tertiary structure. The general attraction of the approach is the ability to optimise any objective function that one can invent. PMID:8628686
Lunter, Gerton; Miklós, István; Drummond, Alexei; Jensen, Jens Ledet; Hein, Jotun
Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both. We developed a fully Bayesian Markov chain Monte Carlo method for coestimating phylogeny and sequence alignment, under the Thorne-Kishino-Felsenstein model of substitution and single nucleotide insertion-deletion (indel) events. In our earlier work, we introduced a novel and efficient algorithm, termed the "indel peeling algorithm", which includes indels as phylogenetically informative evolutionary events, and resembles Felsenstein's peeling algorithm for substitutions on a phylogenetic tree. For a fixed alignment, our extension analytically integrates out both substitution and indel events within a proper statistical model, without the need for data augmentation at internal tree nodes, allowing for efficient sampling of tree topologies and edge lengths. To additionally sample multiple alignments, we here introduce an efficient partial Metropolized independence sampler for alignments, and combine these two algorithms into a fully Bayesian co-estimation procedure for the alignment and phylogeny problem. Our approach results in estimates for the posterior distribution of evolutionary rate parameters, for the maximum a-posteriori (MAP) phylogenetic tree, and for the posterior decoding alignment. Estimates for the evolutionary tree and multiple alignment are augmented with confidence estimates for each node height and alignment column. Our results indicate that the patterns in
Rizk, Guillaume; Lavenier, Dominique
The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold-achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners. GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/ email@example.com; firstname.lastname@example.org Supplementary data are available at Bioinformatics online.
Rizk, Guillaume; Lavenier, Dominique
Motivation: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold—achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. Results: We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners. Availability: GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/ Contact: email@example.com; firstname.lastname@example.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20739310
Bucur, Anca; van Leeuwen, Jasper; Dimitrova, Nevenka; Mittal, Chetan
DNA spectrograms express the periodicities of each of the four nucleotides A, T, C, and G in one or several genomic sequences to be analyzed. DNA spectral analysis can be applied to systematically investigate DNA patterns, which may correspond to relevant biological features. As opposed to looking at nucleotide sequences, spectrogram analysis may detect structural characteristics in very long sequences that are not identifiable by sequence alignment. Alignment of DNA spectrograms can be used to facilitate analysis of very long sequences or entire genomes at different resolutions. Standard clustering algorithms have been used in spectral analysis to find strong patterns in spectra. However, as they use a global distance metric, these algorithms can only detect strong patterns coexisting in several frequencies. In this paper, we propose a new method and several algorithms for aligning spectra suitable for efficient spectral analysis and allowing for the easy detection of strong patterns in both single frequencies and multiple frequencies.
Ranwez, Vincent; Harispe, Sébastien; Delsuc, Frédéric; Douzery, Emmanuel J P
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse.
Ren, Jie; Song, Kai; Sun, Fengzhu; Deng, Minghua; Reinert, Gesine
Motivation: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, and , extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, , and , averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. Results: Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. Availability: Our implementation of the five statistics is available as R package named ‘multiAlignFree’ at be http://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html. Contact: email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23990418
Haygood, M G
This article describes a set of Microsoft Excel macros designed to color amino acid and nucleotide sequence alignments for review and preparation of visual aids. The colored alignments can then be modified to emphasize features of interest. Procedures for importing and coloring sequences are described. The macro file adds a new menu to the menu bar containing sequence-related commands to enable users unfamiliar with Excel to use the macros more readily. The macros were designed for use with Macintosh computers but will also run with the DOS version of Excel.
Naznin, Farhana; Sarker, Ruhul; Essam, Daryl
In order to design life saving drugs, such as cancer drugs, the design of Protein or DNA structures has to be accurate. These structures depend on Multiple Sequence Alignment (MSA). MSA is used to find the accurate structure of Protein and DNA sequences from existing approximately correct sequences. To overcome the overly greedy nature of the well known global progressive alignment method for multiple sequence alignment, we have proposed two different algorithms in this paper; one is using an iterative approach with a progressive alignment method (PAMIM) and the second one is using a genetic algorithm with a progressive alignment method (PAMGA). Both of our methods started with a "kmer" distance table to generate single guide-tree. In the iterative approach, we have introduced two new techniques: the first technique is to generate Guide-trees with randomly selected sequences and the second is of shuffling the sequences inside that tree. The output of the tree is a multiple sequence alignment which has been evaluated by the Sum of Pairs Method (SPM) considering the real value data from PAM250. In our second GA approach, these two techniques are used to generate an initial population and also two different approaches of genetic operators are implemented in crossovers and mutation. To test the performance of our two algorithms, we have compared these with the existing well known methods: T-Coffee, MUSCEL, MAFFT and Probcon, using BAliBase benchmarks. The experimental results show that the first algorithm works well for some situations, where other existing methods face difficulties in obtaining better solutions. The proposed second method works well compared to the existing methods for all situations and it shows better performance over the first one.
Perissinotto, Andrea; Queirós, Sandro; Morais, Pedro; Baptista, Maria J.; Monaghan, Mark; Rodrigues, Nuno F.; D'hooge, Jan; Vilaça, João. L.; Barbosa, Daniel
Given the dynamic nature of cardiac function, correct temporal alignment of pre-operative models and intraoperative images is crucial for augmented reality in cardiac image-guided interventions. As such, the current study focuses on the development of an image-based strategy for temporal alignment of multimodal cardiac imaging sequences, such as cine Magnetic Resonance Imaging (MRI) or 3D Ultrasound (US). First, we derive a robust, modality-independent signal from the image sequences, estimated by computing the normalized cross-correlation between each frame in the temporal sequence and the end-diastolic frame. This signal is a resembler for the left-ventricle (LV) volume curve over time, whose variation indicates different temporal landmarks of the cardiac cycle. We then perform the temporal alignment of these surrogate signals derived from MRI and US sequences of the same patient through Dynamic Time Warping (DTW), allowing to synchronize both sequences. The proposed framework was evaluated in 98 patients, which have undergone both 3D+t MRI and US scans. The end-systolic frame could be accurately estimated as the minimum of the image-derived surrogate signal, presenting a relative error of 1.6 +/- 1.9% and 4.0 +/- 4.2% for the MRI and US sequences, respectively, thus supporting its association with key temporal instants of the cardiac cycle. The use of DTW reduces the desynchronization of the cardiac events in MRI and US sequences, allowing to temporally align multimodal cardiac imaging sequences. Overall, a generic, fast and accurate method for temporal synchronization of MRI and US sequences of the same patient was introduced. This approach could be straightforwardly used for the correct temporal alignment of pre-operative MRI information and intra-operative US images.
Roy, Aparna; Taddese, Bruck; Vohra, Shabana; Thimmaraju, Phani K; Illingworth, Christopher J R; Simpson, Lisa M; Mukherjee, Keya; Reynolds, Christopher A; Chintapalli, Sree V
Multiple sequence alignment (MSA) accuracy is important, but there is no widely accepted method of judging the accuracy that different alignment algorithms give. We present a simple approach to detecting two types of error, namely block shifts and the misplacement of residues within a gap. Given a MSA, subsets of very similar sequences are generated through the use of a redundancy filter, typically using a 70-90% sequence identity cut-off. Subsets thus produced are typically small and degenerate, and errors can be easily detected even by manual examination. The errors, albeit minor, are inevitably associated with gaps in the alignment, and so the procedure is particularly relevant to homology modelling of protein loop regions. The usefulness of the approach is illustrated in the context of the universal but little known [K/R]KLH motif that occurs in intracellular loop 1 of G protein coupled receptors (GPCR); other issues relevant to GPCR modelling are also discussed.
Curilem Saldías, Millaray; Villarroel Sassarini, Felipe; Muñoz Poblete, Carlos; Vargas Vásquez, Asticio; Maureira Butler, Iván
The complexity of searches and the volume of genomic data make sequence alignment one of bioinformatics most active research areas. New alignment approaches have incorporated digital signal processing techniques. Among these, correlation methods are highly sensitive. This paper proposes a novel sequence alignment method based on 2-dimensional images, where each nucleic acid base is represented as a fixed gray intensity pixel. Query and known database sequences are coded to their pixel representation and sequence alignment is handled as object recognition in a scene problem. Query and database become object and scene, respectively. An image correlation process is carried out in order to search for the best match between them. Given that this procedure can be implemented in an optical correlator, the correlation could eventually be accomplished at light speed. This paper shows an initial research stage where results were "digitally" obtained by simulating an optical correlation of DNA sequences represented as images. A total of 303 queries (variable lengths from 50 to 4500 base pairs) and 100 scenes represented by 100 x 100 images each (in total, one million base pair database) were considered for the image correlation analysis. The results showed that correlations reached very high sensitivity (99.01%), specificity (98.99%) and outperformed BLAST when mutation numbers increased. However, digital correlation processes were hundred times slower than BLAST. We are currently starting an initiative to evaluate the correlation speed process of a real experimental optical correlator. By doing this, we expect to fully exploit optical correlation light properties. As the optical correlator works jointly with the computer, digital algorithms should also be optimized. The results presented in this paper are encouraging and support the study of image correlation methods on sequence alignment.
Curilem Saldías, Millaray; Villarroel Sassarini, Felipe; Muñoz Poblete, Carlos; Vargas Vásquez, Asticio; Maureira Butler, Iván
The complexity of searches and the volume of genomic data make sequence alignment one of bioinformatics most active research areas. New alignment approaches have incorporated digital signal processing techniques. Among these, correlation methods are highly sensitive. This paper proposes a novel sequence alignment method based on 2-dimensional images, where each nucleic acid base is represented as a fixed gray intensity pixel. Query and known database sequences are coded to their pixel representation and sequence alignment is handled as object recognition in a scene problem. Query and database become object and scene, respectively. An image correlation process is carried out in order to search for the best match between them. Given that this procedure can be implemented in an optical correlator, the correlation could eventually be accomplished at light speed. This paper shows an initial research stage where results were “digitally” obtained by simulating an optical correlation of DNA sequences represented as images. A total of 303 queries (variable lengths from 50 to 4500 base pairs) and 100 scenes represented by 100 x 100 images each (in total, one million base pair database) were considered for the image correlation analysis. The results showed that correlations reached very high sensitivity (99.01%), specificity (98.99%) and outperformed BLAST when mutation numbers increased. However, digital correlation processes were hundred times slower than BLAST. We are currently starting an initiative to evaluate the correlation speed process of a real experimental optical correlator. By doing this, we expect to fully exploit optical correlation light properties. As the optical correlator works jointly with the computer, digital algorithms should also be optimized. The results presented in this paper are encouraging and support the study of image correlation methods on sequence alignment. PMID:22761742
Parsons, J D; Buehler, E; Hillier, L
DNA sequence chromatograms (traces) are the primary data source for all large-scale genomic and expressed sequence tags (ESTs) sequencing projects. Access to the sequencing trace assists many later analyses, for example contig assembly and polymorphism detection, but obtaining and using traces is problematic. Traces are not collected and published centrally, they are much larger than the base calls derived from them, and viewing them requires the interactivity of a local graphical client with local data. To provide efficient global access to DNA traces, we developed a client/server system based on flexible Java components integrated into other applications including an applet for use in a WWW browser and a stand-alone trace viewer. Client/server interaction is facilitated by CORBA middleware which provides a well-defined interface, a naming service, and location independence. [The software is packaged as a Jar file available from the following URL: http://www.ebi.ac.uk/jparsons. Links to working examples of the trace viewers can be found at http://corba.ebi.ac.uk/EST. All the Washington University mouse EST traces are available for browsing at the same URL.
Chan, Cheong Xin; Bernard, Guillaume; Poirion, Olivier; Hogan, James M; Ragan, Mark A
Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.
Chan, Cheong Xin; Bernard, Guillaume; Poirion, Olivier; Hogan, James M.; Ragan, Mark A.
Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics. PMID:25266120
Chiang, Jason; Studniberg, Michael; Shaw, Jack; Seto, Shaw; Truong, Kevin
To infer homology and subsequently gene function, the Smith-Waterman algorithm is used to find the optimal local alignment between two sequences. When searching sequence databases that may contain billions of sequences, this algorithm becomes computationally expensive. Consequently, in this paper, we focused on accelerating the Smith-Waterman algorithm by modifying the computationally repeated portion of the algorithm by FPGA hardware custom instructions. These simple modifications accelerated the algorithm runtime by an average of 287% compared to the pure software implementation. Therefore, further design of FPGA accelerated hardware offers a promising direction to seeking runtime improvement of genomic database searching.
Jeon, Yoon-Seong; Lee, Kihyun; Park, Sang-Cheol; Kim, Bong-Soo; Cho, Yong-Joon; Ha, Sung-Min; Chun, Jongsik
EzEditor is a Java-based molecular sequence editor allowing manipulation of both DNA and protein sequence alignments for phylogenetic analysis. It has multiple features optimized to connect initial computer-generated multiple alignment and subsequent phylogenetic analysis by providing manual editing with reference to biological information specific to the genes under consideration. It provides various functionalities for editing rRNA alignments using secondary structure information. In addition, it supports simultaneous editing of both DNA sequences and their translated protein sequences for protein-coding genes. EzEditor is, to our knowledge, the first sequence editing software designed for both rRNA- and protein-coding genes with the visualization of biologically relevant information and should be useful in molecular phylogenetic studies. EzEditor is based on Java, can be run on all major computer operating systems and is freely available from http://sw.ezbiocloud.net/ezeditor/.
Kucherov, Gregory; Pinhas, Tamar; Ziv-Ukelson, Michal
Imposing constraints in the form of a finite automaton or a regular expression is an effective way to incorporate additional a priori knowledge into sequence alignment procedures. With this motivation, Arslan  introduced the Regular Language Constrained Sequence Alignment Problem and proposed an O(n 2 t 4) time and O(n 2 t 2) space algorithm for solving it, where n is the length of the input strings and t is the number of states in the non-deterministic automaton, which is given as input. Chung et al.  proposed a faster O(n 2 t 3) time algorithm for the same problem. In this paper, we further speed up the algorithms for Regular Language Constrained Sequence Alignment by reducing their worst case time complexity bound to O(n 2 t 3/logt). This is done by establishing an optimal bound on the size of Straight-Line Programs solving the maxima computation subproblem of the basic dynamic programming algorithm. We also study another solution based on a Steiner Tree computation. While it does not improve the run time complexity in the worst case, our simulations show that both approaches are efficient in practice, especially when the input automata are dense.
Borozan, Ivan; Watt, Stuart; Ferretti, Vincent
Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. firstname.lastname@example.org Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Borozan, Ivan; Watt, Stuart; Ferretti, Vincent
Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913
Grice, J A; Hughey, R; Speck, D
Sequence comparison with affine gap costs is a problem that is readily parallelizable on simple single-instruction, multiple-data stream (SIMD) parallel processors using only constant space per processing element. Unfortunately, the twin problem of sequence alignment, finding the optimal character-by-character correspondence between two sequences, is more complicated. While the innovative O(n2)-time and O(n)-space serial algorithm has been parallelized for multiple-instruction, multiple-data stream (MIMD) computers with only a communication-time slowdown, typically O(log n), it is not suitable for hardware-efficient SIMD parallel processors with only local communication. This paper proposes several methods of computing sequence alignments with limited memory per processing element. The algorithms are also well-suited to serial implementation. The simpler algorithms feature, for an arbitrary integer L, a factor of L slowdown in exchange for reducing space requirements from O(n) to O(L square root of n) per processing element. Using this result, we describe an O(n log n) parallel time algorithm that requires O(log n) space per processing element on O(n) SIMD processing elements with only a mesh or linear interconnection network.
Greene, Eric C.
Homologous recombination allows for the regulated exchange of genetic information between two different DNA molecules of identical or nearly identical sequence composition, and is a major pathway for the repair of double-stranded DNA breaks. A key facet of homologous recombination is the ability of recombination proteins to perfectly align the damaged DNA with homologous sequence located elsewhere in the genome. This reaction is referred to as the homology search and is akin to the target searches conducted by many different DNA-binding proteins. Here I briefly highlight early investigations into the homology search mechanism, and then describe more recent research. Based on these studies, I summarize a model that includes a combination of intersegmental transfer, short-distance one-dimensional sliding, and length-specific microhomology recognition to efficiently align DNA sequences during the homology search. I also suggest some future directions to help further our understanding of the homology search. Where appropriate, I direct the reader to other recent reviews describing various issues related to homologous recombination. PMID:27129270
Greene, Eric C
Homologous recombination allows for the regulated exchange of genetic information between two different DNA molecules of identical or nearly identical sequence composition, and is a major pathway for the repair of double-stranded DNA breaks. A key facet of homologous recombination is the ability of recombination proteins to perfectly align the damaged DNA with homologous sequence located elsewhere in the genome. This reaction is referred to as the homology search and is akin to the target searches conducted by many different DNA-binding proteins. Here I briefly highlight early investigations into the homology search mechanism, and then describe more recent research. Based on these studies, I summarize a model that includes a combination of intersegmental transfer, short-distance one-dimensional sliding, and length-specific microhomology recognition to efficiently align DNA sequences during the homology search. I also suggest some future directions to help further our understanding of the homology search. Where appropriate, I direct the reader to other recent reviews describing various issues related to homologous recombination. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Conservation of a molecular target across species can be used as a line-of-evidence to predict the likelihood of chemical susceptibility. The web-based Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) tool was developed to simplify, streamline, and quantitatively assess protein sequence/structural similarity across taxonomic groups as a means to predict relative intrinsic susceptibility. The intent of the tool is to allow for evaluation of any potential protein target, so it is amenable to variable degrees of protein characterization, depending on available information about the chemical/protein interaction and the molecular target itself. To allow for flexibility in the analysis, a layered strategy was adopted for the tool. The first level of the SeqAPASS analysis compares primary amino acid sequences to a query sequence, calculating a metric for sequence similarity (including detection of candidate orthologs), the second level evaluates sequence similarity within selected domains (e.g., ligand-binding domain, DNA binding domain), and the third level of analysis compares individual amino acid residue positions identified as being of importance for protein conformation and/or ligand binding upon chemical perturbation. Each level of the SeqAPASS analysis provides increasing evidence to apply toward rapid, screening-level assessments of probable cross species susceptibility. Such analyses can support prioritization of chemicals for further ev
Kumar, Sudhir; Filipski, Alan
DNA sequence alignment is a prerequisite to virtually all comparative genomic analyses, including the identification of conserved sequence motifs, estimation of evolutionary divergence between sequences, and inference of historical relationships among genes and species. While it is mere common sense that inaccuracies in multiple sequence alignments can have detrimental effects on downstream analyses, it is important to know the extent to which the inferences drawn from these alignments are robust to errors and biases inherent in all sequence alignments. A survey of investigations into strengths and weaknesses of sequence alignments reveals, as expected, that alignment quality is generally poor for two distantly related sequences and can often be improved by adding additional sequences as stepping stones between distantly related species. Errors in sequence alignment are also found to have a significant negative effect on subsequent inference of sequence divergence, phylogenetic trees, and conserved motifs. However, our understanding of alignment biases remains rudimentary, and sequence alignment procedures continue to be used somewhat like benign formatting operations to make sequences equal in length. Because of the central role these alignments now play in our endeavors to establish the tree of life and to identify important parts of genomes through evolutionary functional genomics, we see a need for increased community effort to investigate influences of alignment bias on the accuracy of large-scale comparative genomics.
Nguyen, Ken D; Pan, Yi
Aligning multiple DNA/RNA/protein sequences to identify common functionalities, structures, or relationships between species is a fundamental task in bioinformatics. In this study, we propose a new multiple sequence strategy that extracts sequence information, sequence global and local similarities to provide different weights for each input sequence. A weighted pair-wise distance matrix is calculated from these sequences to build a dynamic alignment guiding tree. The tree can reorder its higher-level branches based on corresponding alignment results from lower tree levels to guarantee the highest alignment scores at each level of the tree. This technique improves the alignment accuracy up to 10% on many benchmarks tested against alignment tools such as CLUSTALW (Thompson et al., 1994), DIALIGN (Morgenstern, 1999), T-COFFEE (Notredame et al., 2000), MUSCLE (Edgar, 2004), and PROBCONS (Do et al., 2005) of the multiple sequence alignment.
Rosenberg, Michael S
Sequence alignment is a common tool in bioinformatics and comparative genomics. It is generally assumed that multiple sequence alignment yields better results than pair wise sequence alignment, but this assumption has rarely been tested, and never with the control provided by simulation analysis. This study used sequence simulation to examine the gain in accuracy of adding a third sequence to a pair wise alignment, particularly concentrating on how the phylogenetic position of the additional sequence relative to the first pair changes the accuracy of the initial pair's alignment as well as their estimated evolutionary distance. The maximal gain in alignment accuracy was found not when the third sequence is directly intermediate between the initial two sequences, but rather when it perfectly subdivides the branch leading from the root of the tree to one of the original sequences (making it half as close to one sequence as the other). Evolutionary distance estimation in the multiple alignment framework, however, is largely unrelated to alignment accuracy and rather is dependent on the position of the third sequence; the closer the branch leading to the third sequence is to the root of the tree, the larger the estimated distance between the first two sequences. The bias in distance estimation appears to be a direct result of the standard greedy progressive algorithm used by many multiple alignment methods. These results have implications for choosing new taxa and genomes to sequence when resources are limited.
Deng, Xin; Cheng, Jianlin
Multiple Sequence Alignment (MSA) is an essential tool in protein structure modeling, gene and protein function prediction, DNA motif recognition, phylogenetic analysis, and many other bioinformatics tasks. Therefore, improving the accuracy of multiple sequence alignment is an important long-term objective in bioinformatics. We designed and developed a new method MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. Different from the multiple sequence alignment methods that use the tertiary structure information of some sequences, our method uses the structural information purely predicted from sequences. In this chapter, we first introduce some background and related techniques in the field of multiple sequence alignment. Then, we describe the detailed algorithm of MSACompro. Finally, we show that integrating predicted protein structural information improved the multiple sequence alignment accuracy.
Nguyen, Ken D; Pan, Yi
A common and cost-effective mechanism to identify the functionalities, structures, or relationships between species is multiple-sequence alignment, in which DNA/RNA/protein sequences are arranged and aligned so that similarities between sequences are clustered together. Correctly identifying and aligning these sequence biological similarities help from unwinding the mystery of species evolution to drug design. We present our knowledge-based multiple sequence alignment (KB-MSA) technique that utilizes the existing knowledge databases such as SWISSPROT, GENBANK, or HOMSTRAD to provide a more realistic and reliable sequence alignment. We also provide a modified version of this algorithm (CB-MSA) that utilizes the sequence consistency information when sequence knowledge databases are not available. Our benchmark tests on BAliBASE, PREFAB, HOMSTRAD, and SABMARK references show accuracy improvements up to 10 percent on twilight data sets against many leading alignment tools such as ISPALIGN, PADT, CLUSTALW, MAFFT, PROBCONS, and T-COFFEE.
Chakrabarti, Saikat; Lanczycki, Christopher J.; Panchenko, Anna R.; Przytycka, Teresa M.; Thiessen, Paul A.; Bryant, Stephen H.
Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution () and will be incorporated into the next release of the Cn3D structure/alignment viewer. PMID:16707662
Nellore, Abhinav; Bobkov, Konstantin; Howe, Elizabeth; Pankov, Aleksandr; Diaz, Aaron; Song, Jun S.
We introduce NSeq, a fast and efficient Java application for finding positioned nucleosomes from the high-throughput sequencing of MNase-digested mononucleosomal DNA. NSeq includes a user-friendly graphical interface, computes false discovery rates (FDRs) for candidate nucleosomes from Monte Carlo simulations, plots nucleosome coverage and centers, and exploits the availability of multiple processor cores by parallelizing its computations. Java binaries and source code are freely available at https://github.com/songlab/NSeq. The software is supported on all major platforms equipped with Java Runtime Environment 6 or later. PMID:23335939
Penner, Orion; Grassberger, Peter; Paczuski, Maya
Alignment of biological sequences such as DNA, RNA or proteins is one of the most widely used tools in computational bioscience. While the accomplishments of sequence alignment algorithms are undeniable the fact remains that these algorithms are based upon heuristic scoring schemes. Therefore, these algorithms do not provide model independent and objective measures for how similar two (or more) sequences actually are. Although information theory provides such a similarity measure - the mutual information (MI) - numerous previous attempts to connect sequence alignment and information have not produced realistic estimates for the MI from a given alignment. We report on a simple and flexible approach to get robust estimates of MI from global alignments. The presented results may help establish MI as a reliable tool for evaluating the quality of global alignments, judging the relative merits of different alignment algorithms, and estimating the significance of specific alignments.
Tress, Michael L; Graña, Osvaldo; Valencia, Alfonso
The Server for Quick Alignment Reliability Evaluation (SQUARE) is a Web-based version of the method we developed to predict regions of reliably aligned residues in sequence alignments. Given an alignment between a query sequence and a sequence of known structure, SQUARE is able to predict which residues are reliably aligned. The server accesses a database of profiles of sequences of known three-dimensional structures in order to calculate the scores for each residue in the alignment. SQUARE produces a graphical output of the residue profile-derived alignment scores along with an indication of the reliability of the alignment. In addition, the scores can be compared against template secondary structure, conserved residues and important sites.
Bhat, Basharat; Ganai, Nazir A; Andrabi, Syed Mudasir; Shah, Riaz A; Singh, Ashutosh
Membrane proteins plays significant role in living cells. Transmembrane proteins are estimated to constitute approximately 30% of proteins at genomic scale. It has been a difficult task to develop specific alignment tools for transmembrane proteins due to limited number of experimentally validated protein structures. Alignment tools based on homology modeling provide fairly good result by recapitulating 70-80% residues in reference alignment provided all input sequences should have known template structures. However, homology modeling tools took substantial amount of time, thus aligning large numbers of sequences becomes computationally demanding. Here we present TM-Aligner, a new tool for transmembrane protein sequence alignment. TM-Aligner is based on Wu-Manber and dynamic string matching algorithm which has significantly improved its accuracy and speed of multiple sequence alignment. We compared TM-Aligner with prevailing other popular tools and performed benchmarking using three separate reference sets, BaliBASE3.0 reference set7 of alpha-helical transmembrane proteins, structure based alignment of transmembrane proteins from Pfam database and structure alignment from GPCRDB. Benchmarking against reference datasets indicated that TM-Aligner is more advanced method having least turnaround time with significant improvements over the most accurate methods such as PROMALS, MAFFT, TM-Coffee, Kalign, ClustalW, Muscle and PRALINE. TM-Aligner is freely available through http://lms.snu.edu.in/TM-Aligner/ .
Birzele, Fabian; Gewehr, Jan E; Zimmer, Ralf
Sequence-structure alignments are a common means for protein structure prediction in the fields of fold recognition and homology modeling, and there is a broad variety of programs that provide such alignments based on sequence similarity, secondary structure or contact potentials. Nevertheless, finding the best sequence-structure alignment in a pool of alignments remains a difficult problem. QUASAR (quality of sequence-structure alignments ranking) provides a unifying framework for scoring sequence-structure alignments that aids finding well-performing combinations of well-known and custom-made scoring schemes. Those scoring functions can be benchmarked against widely accepted quality scores like MaxSub, TMScore, Touch and APDB, thus enabling users to test their own alignment scores against 'standard-of-truth' structure-based scores. Furthermore, individual score combinations can be optimized with respect to benchmark sets based on known structural relationships using QUASAR's in-built optimization routines.
Löytynoja, Ari; Goldman, Nick
Phylogeny-aware progressive alignment has been found to perform well in phylogenetic alignment benchmarks and to produce superior alignments for the inference of selection on codon sequences. Its implementation in the PRANK alignment program package also allows modelling of complex evolutionary processes and inference of posterior probabilities for sequence sites evolving under each distinct scenario, either simultaneously with the alignment of sequences or as a post-processing step for an existing alignment. This has led to software with many advanced features, and users may find it difficult to generate optimal alignments, visualise the full information in their alignment results, or post-process these results, e.g. by objectively selecting subsets of alignment sites. We have created a web server called webPRANK that provides an easy-to-use interface to the PRANK phylogeny-aware alignment algorithm. The webPRANK server supports the alignment of DNA, protein and codon sequences as well as protein-translated alignment of cDNAs, and includes built-in structure models for the alignment of genomic sequences. The resulting alignments can be exported in various formats widely used in evolutionary sequence analyses. The webPRANK server also includes a powerful web-based alignment browser for the visualisation and post-processing of the results in the context of a cladogram relating the sequences, allowing (e.g.) removal of alignment columns with low posterior reliability. In addition to de novo alignments, webPRANK can be used for the inference of ancestral sequences with phylogenetically realistic gap patterns, and for the annotation and post-processing of existing alignments. The webPRANK server is freely available on the web at http://tinyurl.com/webprank . The webPRANK server incorporates phylogeny-aware multiple sequence alignment, visualisation and post-processing in an easy-to-use web interface. It widens the user base of phylogeny-aware multiple sequence
Background Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from random. For such distant pairs, structure alignment is of much better quality. Nevertheless, sequence alignment is the only choice in the majority of cases where structural data is not available. This situation demands development of methods that extend the applicability of accurate sequence alignment to distantly related proteins. Results We develop a sequence alignment method that combines the prediction of a structural profile based on the protein's sequence with the alignment of that profile using our recently published alignment tool SABERTOOTH. In particular, we predict the contact vector of protein structures using an artificial neural network based on position-specific scoring matrices generated by PSI-BLAST and align these predicted contact vectors. The resulting sequence alignments are assessed using two different tests: First, we assess the alignment quality by measuring the derived structural similarity for cases in which structures are available. In a second test, we quantify the ability of the significance score of the alignments to recognize structural and evolutionary relationships. As a benchmark we use a representative set of the SCOP (structural classification of proteins) database, with similarities ranging from closely related proteins at SCOP family level, to very distantly related proteins at SCOP fold level. Comparing these results with some prominent sequence alignment tools, we find that SABERTOOTH produces sequence alignments of better quality than those of Clustal W, T-Coffee, MUSCLE, and PSI-BLAST. HHpred, one of the most sophisticated and computationally expensive tools available, outperforms our alignment algorithm at family and superfamily levels
Nute, Michael; Warnow, Tandy
Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences. Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.
Aruk, Taner; Ustek, Duran; Kursun, Olcay
Finding large deletions in genome sequences has become increasingly more useful in bioinformatics, such as in clinical research and diagnosis. Although there are a number of publically available next generation sequencing mapping and sequence alignment programs, these software packages do not correctly align fragments containing deletions larger than one kb. We present a fast alignment software package, BinaryPartialAlign, that can be used by wet lab scientists to find long structural variations in their experiments. For BinaryPartialAlign, we make use of the Smith-Waterman (SW) algorithm with a binary-search-based approach for alignment with large gaps that we called partial alignment. BinaryPartialAlign implementation is compared with other straight-forward applications of SW. Simulation results on mtDNA fragments demonstrate the effectiveness (runtime and accuracy) of the proposed method.
Colbourn, Charles J; Kumar, Sudhir
Multiple sequence alignment is fundamental. Exponential growth in computation time appears to be inevitable when an optimal alignment is required for many sequences. Exact costs of optimum alignments are therefore rarely computed. Consequently much effort has been invested in algorithms for alignment that are heuristic, or explore a restricted class of solutions. These give an upper bound on the alignment cost, but it is equally important to determine the quality of the solution obtained. In the absence of an optimal alignment with which to compare, lower bounds may be calculated to assess the quality of the alignment. As more effort is invested in improving upper bounds (alignment algorithms), it is therefore important to improve lower bounds as well. Although numerous cost metrics can be used to determine the quality of an alignment, many are based on sum-of-pairs (SP) measures and their generalizations. Two standard and two new methods are considered for using exact 2-way and 3-way alignments to compute lower bounds on total SP alignment cost; one new method fares well with respect to accuracy, while the other reduces the computation time. The first employs exhaustive computation of exact 3-way alignments, while the second employs an efficient heuristic to compute a much smaller number of exact 3-way alignments. Calculating all 3-way alignments exactly and computing their average improves lower bounds on sum of SP cost in v-way alignments. However judicious selection of a subset of all 3-way alignments can yield a further improvement with minimal additional effort. On the other hand, a simple heuristic to select a random subset of 3-way alignments (a random packing) yields accuracy comparable to averaging all 3-way alignments with substantially less computational effort. Calculation of lower bounds on SP cost (and thus the quality of an alignment) can be improved by employing a mixture of 3-way and 2-way alignments.
Background In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained. Results We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments. Conclusions A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. AMS Subject Classification Primary 92B05, 33C20, secondary 39A14, 65Q30 PMID:24684679
Brudno, Michael; Chapman, Michael; Göttgens, Berthold; Batzoglou, Serafim; Morgenstern, Burkhard
Background Genomic sequence alignment is a powerful method for genome analysis and annotation, as alignments are routinely used to identify functional sites such as genes or regulatory elements. With a growing number of partially or completely sequenced genomes, multiple alignment is playing an increasingly important role in these studies. In recent years, various tools for pair-wise and multiple genomic alignment have been proposed. Some of them are extremely fast, but often efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast search program identifies a chain of strong local sequence similarities. In a second step, regions between these anchor points are aligned using a slower but more accurate method. Results Herein, we present CHAOS, a novel algorithm for rapid identification of chains of local pair-wise sequence similarities. Local alignments calculated by CHAOS are used as anchor points to improve the running time of DIALIGN, a slow but sensitive multiple-alignment tool. We show that this way, the running time of DIALIGN can be reduced by more than 95% for BAC-sized and longer sequences, without affecting the quality of the resulting alignments. We apply our approach to a set of five genomic sequences around the stem-cell-leukemia (SCL) gene and demonstrate that exons and small regulatory elements can be identified by our multiple-alignment procedure. Conclusion We conclude that the novel CHAOS local alignment tool is an effective way to significantly speed up global alignment tools such as DIALIGN without reducing the alignment quality. We likewise demonstrate that the DIALIGN/CHAOS combination is able to accurately align short regulatory sequences in distant orthologues. PMID:14693042
Homer, Nils; Merriman, Barry; Nelson, Stanley F
Background DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity. Results We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions. Conclusion The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data. PMID:19508732
Meyers, H. G., IV; Sautter, L.
The Ontong Java Plateau (OJP), north of the Solomon Islands, Indonesia, is a submerged seafloor platform, larger than Alaska and full of intricate systems of channels, atolls and seamounts. This area has remained relatively unstudied because of both the area's remote location and low number of ships carrying advanced sonar systems. The OJP is believed to have been formed by one of the largest volcanic eruptions in Earth's history. This study uses EM302 multibeam sonar data collected on the R/V Falkor in 2014 by the University of Tasmania's Institute for Marine and Antarctic Studies to better understand relationships between the seafloor geomorphology and tectonic processes that formed numerous unexplored seamounts. The area surveyed is situated along the OJP's central northeast margin, and includes a small chain of six seamounts that range from 300 to 700 m in vertical relief. These seamounts are situated within the axis of a major 14 km wide submarine channel that was likely formed by a sequence of turbidity currents. Using CARIS HIPS and SIPS 9.0 post-processing software, seamount and channel morphology were characterized with 2 dimensional profiles and 3 dimensional images. Backscatter intensity was used to identify relative substrate hardness of the seamounts and surrounding seafloor areas. Scour and depositional features from the turbidity flows are evident at the base of several seamounts, indicating that the submarine channel bifurcated when turbidity flows encountered the seamount chain.
Demkin, V V
Excel platform was used for transition of results of multiple aligned nucleotide sequences obtained using the BLAST network service to the form appropriate for visual analysis and editing. Two macros operators for MS Excel 2007 were constructed. The array of aligned sequences transformed into Excel table and processed using macros operators is more appropriate for analysis than initial html data.
Lin, Luan; Khider, Deborah; Lisiecki, Lorraine E.; Lawrence, Charles E.
The assessment of age uncertainty in stratigraphically aligned records is a pressing need in paleoceanographic research. The alignment of ocean sediment cores is used to develop mutually consistent age models for climate proxies and is often based on the δ18O of calcite from benthic foraminifera, which records a global ice volume and deep water temperature signal. To date, δ18O alignment has been performed by manual, qualitative comparison or by deterministic algorithms. Here we present a hidden Markov model (HMM) probabilistic algorithm to find 95% confidence bands for δ18O alignment. This model considers the probability of every possible alignment based on its fit to the δ18O data and transition probabilities for sedimentation rate changes obtained from radiocarbon-based estimates for 37 cores. Uncertainty is assessed using a stochastic back trace recursion to sample alignments in exact proportion to their probability. We applied the algorithm to align 35 late Pleistocene records to a global benthic δ18O stack and found that the mean width of 95% confidence intervals varies between 3 and 23 kyr depending on the resolution and noisiness of the record's δ18O signal. Confidence bands within individual cores also vary greatly, ranging from ~0 to >40 kyr. These alignment uncertainty estimates will allow researchers to examine the robustness of their conclusions, including the statistical evaluation of lead-lag relationships between events observed in different cores.
Bawono, Punto; van der Velde, Arjan; Abeln, Sanne; Heringa, Jaap
Multiple Sequence Alignment (MSA) methods are typically benchmarked on sets of reference alignments. The quality of the alignment can then be represented by the sum-of-pairs (SP) or column (CS) scores, which measure the agreement between a reference and corresponding query alignment. Both the SP and CS scores treat mismatches between a query and reference alignment as equally bad, and do not take the separation into account between two amino acids in the query alignment, that should have been matched according to the reference alignment. This is significant since the magnitude of alignment shifts is often of relevance in biological analyses, including homology modeling and MSA refinement/manual alignment editing. In this study we develop a new alignment benchmark scoring scheme, SPdist, that takes the degree of discordance of mismatches into account by measuring the sequence distance between mismatched residue pairs in the query alignment. Using this new score along with the standard SP score, we investigate the discriminatory behavior of the new score by assessing how well six different MSA methods perform with respect to BAliBASE reference alignments. The SP score and the SPdist score yield very similar outcomes when the reference and query alignments are close. However, for more divergent reference alignments the SPdist score is able to distinguish between methods that keep alignments approximately close to the reference and those exhibiting larger shifts. We observed that by using SPdist together with SP scoring we were able to better delineate the alignment quality difference between alternative MSA methods. With a case study we exemplify why it is important, from a biological perspective, to consider the separation of mismatches. The SPdist scoring scheme has been implemented in the VerAlign web server (http://www.ibi.vu.nl/programs/veralignwww/). The code for calculating SPdist score is also available upon request. PMID:25993129
Bawono, Punto; van der Velde, Arjan; Abeln, Sanne; Heringa, Jaap
Multiple Sequence Alignment (MSA) methods are typically benchmarked on sets of reference alignments. The quality of the alignment can then be represented by the sum-of-pairs (SP) or column (CS) scores, which measure the agreement between a reference and corresponding query alignment. Both the SP and CS scores treat mismatches between a query and reference alignment as equally bad, and do not take the separation into account between two amino acids in the query alignment, that should have been matched according to the reference alignment. This is significant since the magnitude of alignment shifts is often of relevance in biological analyses, including homology modeling and MSA refinement/manual alignment editing. In this study we develop a new alignment benchmark scoring scheme, SPdist, that takes the degree of discordance of mismatches into account by measuring the sequence distance between mismatched residue pairs in the query alignment. Using this new score along with the standard SP score, we investigate the discriminatory behavior of the new score by assessing how well six different MSA methods perform with respect to BAliBASE reference alignments. The SP score and the SPdist score yield very similar outcomes when the reference and query alignments are close. However, for more divergent reference alignments the SPdist score is able to distinguish between methods that keep alignments approximately close to the reference and those exhibiting larger shifts. We observed that by using SPdist together with SP scoring we were able to better delineate the alignment quality difference between alternative MSA methods. With a case study we exemplify why it is important, from a biological perspective, to consider the separation of mismatches. The SPdist scoring scheme has been implemented in the VerAlign web server (http://www.ibi.vu.nl/programs/veralignwww/). The code for calculating SPdist score is also available upon request.
Afridi, Muhammad Ishaq
The family of evolutionary or genetic algorithms is used in various fields of bioinformatics. Genetic algorithms (GAs) can be used for simultaneous comparison of a large pool of DNA or protein sequences. This article explains how the GA is used in combination with other methods like the progressive multiple sequence alignment strategy to get an optimal multiple sequence alignment (MSA). Optimal MSA get much importance in the field of bioinformatics and some other related disciplines. Evolutionary algorithms evolve and improve their performance. In this optimisation, the initial pair-wise alignment is achieved through a progressive method and then a good objective function is used to select and align more alignments and profiles. Child and subpopulation initialisation is based upon changes in the probability of similarity or the distance matrix of the alignment population. In this genetic algorithm, optimisation of mutation, crossover and migration in the population of candidate solution reflect events of natural organic evolution.
Taylor, William R
Current volumes of sequence data can lead to large numbers of hits identified on a search, typically in the range of 10s to 100s of thousands. It is often quite difficult to tell from these raw results whether the search has been a success or has picked-up sequences with little or no relationship to the query. The best approach to this problem is to cluster and align the resulting families, however, existing methods concentrate on fast clustering and either do not align the sequences or only perform a limited alignment. A method (MULSEL) is presented that combines fast peptide-based pre-sorting with a following cascade of mini-alignments, each of which are generated with a robust profile/profile method. From these mini-alignments, a representative sequence is selected, based on a variety of intrinsic and user-specified criteria that are combined to produce the sequence collection for the next cycle of alignment. For moderate sized sequence collections (10s of thousands) the method executes on a laptop computer within seconds or minutes. MULSEL bridges a gap between fast clustering methods and slower multiple sequence alignment methods and provides a seamless transition from one to the other. Furthermore, it presents the resulting reduced family in a graphical manner that makes it clear if family members have been misaligned or if there are sequences present that appear inconsistent.
Bradley, Robert K; Pachter, Lior; Holmes, Ian
Whole-genome screens suggest that eukaryotic genomes are dense with non-coding RNAs (ncRNAs). We introduce a novel approach to RNA multiple alignment which couples a generative probabilistic model of sequence and structure with an efficient sequence annealing approach for exploring the space of multiple alignments. This leads to a new software program, Stemloc-AMA, that is both accurate and specific in the alignment of multiple related RNA sequences. When tested on the benchmark datasets BRalibase II and BRalibase 2.1, Stemloc-AMA has comparable sensitivity to and better specificity than the best competing methods. We use a large-scale random sequence experiment to show that while most alignment programs maximize sensitivity at the expense of specificity, even to the point of giving complete alignments of non-homologous sequences, Stemloc-AMA aligns only sequences with detectable homology and leaves unrelated sequences largely unaligned. Such accurate and specific alignments are crucial for comparative-genomics analysis, from inferring phylogeny to estimating substitution rates across different lineages. Stemloc-AMA is available from http://biowiki.org/StemLocAMA as part of the dart software package for sequence analysis.
One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality.
One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770
Ibarra, Ignacio L; Melo, Francisco
Dynamic programming (DP) is a general optimization strategy that is successfully used across various disciplines of science. In bioinformatics, it is widely applied in calculating the optimal alignment between pairs of protein or DNA sequences. These alignments form the basis of new, verifiable biological hypothesis. Despite its importance, there are no interactive tools available for training and education on understanding the DP algorithm. Here, we introduce an interactive computer application with a graphical interface, for the purpose of educating students about DP. The program displays the DP scoring matrix and the resulting optimal alignment(s), while allowing the user to modify key parameters such as the values in the similarity matrix, the sequence alignment algorithm version and the gap opening/extension penalties. We hope that this software will be useful to teachers and students of bioinformatics courses, as well as researchers who implement the DP algorithm for diverse applications. The software is freely available at: http:/melolab.org/sat. The software is written in the Java computer language, thus it runs on all major platforms and operating systems including Windows, Mac OS X and LINUX. All inquiries or comments about this software should be directed to Francisco Melo at firstname.lastname@example.org.
Katoh, Kazutaka; Standley, Daron M
We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ email@example.com Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Fazal, Mohammed-Abbas; Alexander, Sarah; Burnett, Edward; Deheer-Graham, Ana; Oliver, Karen; Holroyd, Nancy; Parkhill, Julian; Russell, Julie E
Salmonellae are a significant cause of morbidity and mortality globally. Here, we report the first complete genome sequence for Salmonella enterica subsp. enterica serovar Java strain NCTC5706. This strain is of historical significance, having been isolated in the pre-antibiotic era and was deposited into the National Collection of Type Cultures in 1939.
Fazal, Mohammed-Abbas; Burnett, Edward; Deheer-Graham, Ana; Oliver, Karen; Holroyd, Nancy; Russell, Julie E.
Salmonellae are a significant cause of morbidity and mortality globally. Here, we report the first complete genome sequence for Salmonella enterica subsp. enterica serovar Java strain NCTC5706. This strain is of historical significance, having been isolated in the pre-antibiotic era and was deposited into the National Collection of Type Cultures in 1939. PMID:27811100
Landan, Giddy; Graur, Dan
We characterize pairwise and multiple sequence alignment (MSA) errors by comparing true alignments from simulations of sequence evolution with reconstructed alignments. The vast majority of reconstructed alignments contain many errors. Error rates rapidly increase with sequence divergence, thus, for even intermediate degrees of sequence divergence, more than half of the columns of a reconstructed alignment may be expected to be erroneous. In closely related sequences, most errors consist of the erroneous positioning of a single indel event and their effect is local. As sequences diverge, errors become more complex as a result of the simultaneous mis-reconstruction of many indel events, and the lengths of the affected MSA segments increase dramatically. We found a systematic bias towards underestimation of the number of gaps, which leads to the reconstructed MSA being on average shorter than the true one. Alignment errors are unavoidable even when the evolutionary parameters are known in advance. Correct reconstruction can only be guaranteed when the likelihood of true alignment is uniquely optimal. However, true alignment features are very frequently sub-optimal or co-optimal, with the result that optimal albeit erroneous features are incorporated into the reconstructed MSA. Progressive MSA utilizes a guide-tree in the reconstruction of MSAs. The quality of the guide-tree was found to affect MSA error levels only marginally.
Wright, Erik S
Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the
Hara, Toshihide; Sato, Keiko; Ohya, Masanori
In a previous paper1 we proposed a new method for performing pairwise alignment of protein sequences. The method, called MTRAP, achieves the highest performance compared with other alignment methods such as ClustalW22,3 on two benchmarks for alignment accuracy. In this paper, we introduce a new measure between two amino acids based on the formation of peptide bonds. The measure is implemented into MTRAP software to further improve alignment accuracy. Our alignment software is available at
Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh
The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.
Zhang, Zefeng; Lin, Hao; Li, Ming
Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.
Gondro, C; Kinghorn, B P
Multiple sequence alignment plays an important role in molecular sequence analysis. An alignment is the arrangement of two (pairwise alignment) or more (multiple alignment) sequences of 'residues' (nucleotides or amino acids) that maximizes the similarities between them. Algorithmically, the problem consists of opening and extending gaps in the sequences to maximize an objective function (measurement of similarity). A simple genetic algorithm was developed and implemented in the software MSA-GA. Genetic algorithms, a class of evolutionary algorithms, are well suited for problems of this nature since residues and gaps are discrete units. An evolutionary algorithm cannot compete in terms of speed with progressive alignment methods but it has the advantage of being able to correct for initially misaligned sequences; which is not possible with the progressive method. This was shown using the BaliBase benchmark, where Clustal-W alignments were used to seed the initial population in MSA-GA, improving outcome. Alignment scoring functions still constitute an open field of research, and it is important to develop methods that simplify the testing of new functions. A general evolutionary framework for testing and implementing different scoring functions was developed. The results show that a simple genetic algorithm is capable of optimizing an alignment without the need of the excessively complex operators used in prior study. The clear distinction between objective function and genetic algorithms used in MSA-GA makes extending and/or replacing objective functions a trivial task.
Harris, Simon R.; Otto, Thomas D.; Berriman, Matthew; Parkhill, Julian; McQuillan, Jacqueline A.
So-called next-generation sequencing (NGS) has provided the ability to sequence on a massive scale at low cost, enabling biologists to perform powerful experiments and gain insight into biological processes. BamView has been developed to visualize and analyse sequence reads from NGS platforms, which have been aligned to a reference sequence. It is a desktop application for browsing the aligned or mapped reads [Ruffalo, M, LaFramboise, T, Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011;27:2790–6] at different levels of magnification, from nucleotide level, where the base qualities can be seen, to genome or chromosome level where overall coverage is shown. To enable in-depth investigation of NGS data, various views are provided that can be configured to highlight interesting aspects of the data. Multiple read alignment files can be overlaid to compare results from different experiments, and filters can be applied to facilitate the interpretation of the aligned reads. As well as being a standalone application it can be used as an integrated part of the Artemis genome browser, BamView allows the user to study NGS data in the context of the sequence and annotation of the reference genome. Single nucleotide polymorphism (SNP) density and candidate SNP sites can be highlighted and investigated, and read-pair information can be used to discover large structural insertions and deletions. The application will also calculate simple analyses of the read mapping, including reporting the read counts and reads per kilobase per million mapped reads (RPKM) for genes selected by the user. Availability: BamView and Artemis are freely available software. These can be downloaded from their home pages: http://bamview.sourceforge.net/; http://www.sanger.ac.uk/resources/software/artemis/. Requirements: Java 1.6 or higher. PMID:22253280
Knudsen, Bjarne; Miyamoto, Michael M
This work presents a novel pairwise statistical alignment method based on an explicit evolutionary model of insertions and deletions (indels). Indel events of any length are possible according to a geometric distribution. The geometric distribution parameter, the indel rate, and the evolutionary time are all maximum likelihood estimated from the sequences being aligned. Probability calculations are done using a pair hidden Markov model (HMM) with transition probabilities calculated from the indel parameters. Equations for the transition probabilities make the pair HMM closely approximate the specified indel model. The method provides an optimal alignment, its likelihood, the likelihood of all possible alignments, and the reliability of individual alignment regions. Human alpha and beta-hemoglobin sequences are aligned, as an illustration of the potential utility of this pair HMM approach.
Nix, David A; Eisen, Michael B
Several problems exist with current methods used to align DNA sequences for comparative sequence analysis. Most dynamic programming algorithms assume that conserved sequence elements are collinear. This assumption appears valid when comparing orthologous protein coding sequences. Functional constraints on proteins provide strong selective pressure against sequence inversions, and minimize sequence duplications and feature shuffling. For non-coding sequences this collinearity assumption is often invalid. For example, enhancers contain clusters of transcription factor binding sites that change in number, orientation, and spacing during evolution yet the enhancer retains its activity. Dot plot analysis is often used to estimate non-coding sequence relatedness. Yet dot plots do not actually align sequences and thus cannot account well for base insertions or deletions. Moreover, they lack an adequate statistical framework for comparing sequence relatedness and are limited to pairwise comparisons. Lastly, dot plots and dynamic programming text outputs fail to provide an intuitive means for visualizing DNA alignments. To address some of these issues, we created a stand alone, platform independent, graphic alignment tool for comparative sequence analysis (GATA http://gata.sourceforge.net/). GATA uses the NCBI-BLASTN program and extensive post-processing to identify all small sub-alignments above a low cut-off score. These are graphed as two shaded boxes, one for each sequence, connected by a line using the coordinate system of their parent sequence. Shading and colour are used to indicate score and orientation. A variety of options exist for querying, modifying and retrieving conserved sequence elements. Extensive gene annotation can be added to both sequences using a standardized General Feature Format (GFF) file. GATA uses the NCBI-BLASTN program in conjunction with post-processing to exhaustively align two DNA sequences. It provides researchers with a fine
Oliveira, Francisco P M; Sousa, Andreia; Santos, Rubim; Tavares, João Manuel R S
This article presents a methodology to align plantar pressure image sequences simultaneously in time and space. The spatial position and orientation of a foot in a sequence are changed to match the foot represented in a second sequence. Simultaneously with the spatial alignment, the temporal scale of the first sequence is transformed with the aim of synchronizing the two input footsteps. Consequently, the spatial correspondence of the foot regions along the sequences as well as the temporal synchronizing is automatically attained, making the study easier and more straightforward. In terms of spatial alignment, the methodology can use one of four possible geometric transformation models: rigid, similarity, affine, or projective. In the temporal alignment, a polynomial transformation up to the 4th degree can be adopted in order to model linear and curved time behaviors. Suitable geometric and temporal transformations are found by minimizing the mean squared error (MSE) between the input sequences. The methodology was tested on a set of real image sequences acquired from a common pedobarographic device. When used in experimental cases generated by applying geometric and temporal control transformations, the methodology revealed high accuracy. In addition, the intra-subject alignment tests from real plantar pressure image sequences showed that the curved temporal models produced better MSE results (P < 0.001) than the linear temporal model. This article represents an important step forward in the alignment of pedobarographic image data, since previous methods can only be applied on static images.
Tong, Jing; Pei, Jimin; Otwinowski, Zbyszek; Grishin, Nick V
Constructing a model of a query protein based on its alignment to a homolog with experimentally determined spatial structure (the template) is still the most reliable approach to structure prediction. Alignment errors are the main bottleneck for homology modeling when the query is distantly related to the template. Alignment methods often misalign secondary structural elements by a few residues. Therefore, better alignment solutions can be found within a limited set of local shifts of secondary structures. We present a refinement method to improve pairwise sequence alignments by evaluating alignment variants generated by local shifts of template-defined secondary structures. Our method SFESA is based on a novel scoring function that combines the profile-based sequence score and the structure score derived from residue contacts in a template. Such a combined score frequently selects a better alignment variant among a set of candidate alignments generated by local shifts and leads to overall increase in alignment accuracy. Evaluation of several benchmarks shows that our refinement method significantly improves alignments made by automatic methods such as PROMALS, HHpred and CNFpred. The web server is available at http://prodata.swmed.edu/sfesa. © 2014 Wiley Periodicals, Inc.
Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro
Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.
Pham, Tuan D; Zuegg, Johannes
Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). All datasets and computer codes written in MATLAB are available upon request from the first author.
Cutello, Vincenzo; Nicosia, Giuseppe; Pavone, Mario; Prizzi, Igor
This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the 'weighted sum of pairs' as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space.
Van Walle, Ivo; Lasters, Ignace; Wyns, Lode
Multiple alignment of highly divergent sequences is a challenging problem for which available programs tend to show poor performance. Generally, this is due to a scoring function that does not describe biological reality accurately enough or a heuristic that cannot explore solution space efficiently enough. In this respect, we present a new program, Align-m, that uses a non-progressive local approach to guide a global alignment. Two large test sets were used that represent the entire SCOP classification and cover sequence similarities between 0 and 50% identity. Performance was compared with the publicly available algorithms ClustalW, T-Coffee and DiAlign. In general, Align-m has comparable or slightly higher accuracy in terms of correctly aligned residues, especially for distantly related sequences. Importantly, it aligns much fewer residues incorrectly, with average differences of over 15% compared with some of the other algorithms. Align-m and the test sets are available at http://bioinformatics.vub.ac.be
Dowell, Robin D; Eddy, Sean R
Background We are interested in the problem of predicting secondary structure for small sets of homologous RNAs, by incorporating limited comparative sequence information into an RNA folding model. The Sankoff algorithm for simultaneous RNA folding and alignment is a basis for approaches to this problem. There are two open problems in applying a Sankoff algorithm: development of a good unified scoring system for alignment and folding and development of practical heuristics for dealing with the computational complexity of the algorithm. Results We use probabilistic models (pair stochastic context-free grammars, pairSCFGs) as a unifying framework for scoring pairwise alignment and folding. A constrained version of the pairSCFG structural alignment algorithm was developed which assumes knowledge of a few confidently aligned positions (pins). These pins are selected based on the posterior probabilities of a probabilistic pairwise sequence alignment. Conclusion Pairwise RNA structural alignment improves on structure prediction accuracy relative to single sequence folding. Constraining on alignment is a straightforward method of reducing the runtime and memory requirements of the algorithm. Five practical implementations of the pairwise Sankoff algorithm – this work (Consan), David Mathews' Dynalign, Ian Holmes' Stemloc, Ivo Hofacker's PMcomp, and Jan Gorodkin's FOLDALIGN – have comparable overall performance with different strengths and weaknesses. PMID:16952317
Bauer, Markus; Klau, Gunnar W; Reinert, Knut
The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from http://www.planet-lisa.net.
Simmons, Mark P; Müller, Kai F; Norton, Andrew P
We used random sequences to determine which alignment methods are most susceptible to aligning sequences so as to create artifactual resolution and branch support in phylogenetic trees derived from those alignments. We compared four alignment methods (progressive pairwise alignment, simultaneous multiple alignment of sequence fragments, local pairwise alignment, and direct optimization) to determine which methods are most susceptible to creating false positives in phylogenetic trees. Implied alignments created using direct optimization provided more artifactual support than progressive pairwise alignment methods, which in turn generally provided more artifactual support than simultaneous and local alignment methods. Artifactual support derived from base pairs was generally reinforced by the incorporation of gap characters for progressive pairwise alignment, local pairwise alignment, and implied alignments. The amount of artifactual resolution and support was generally greater for simulated nucleotide sequences than for simulated amino acid sequences. In the context of direct optimization, the differences between static and dynamic approaches to calculating support were extreme, ranging from maximal to nearly minimal support. When applied to highly divergent sequences, it is important that dynamic, rather than static, characters be used whenever calculating branch support using direct optimization. In contrast to the tree-based approaches to alignment, simultaneous alignment of sequences using the similarity criterion generally does not create alignments that are biased in favor of any particular tree topology. Copyright © 2010 Elsevier Inc. All rights reserved.
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing. PMID:20460430
Zambrano-Vega, Cristian; Nebro, Antonio J; García-Nieto, José; Aldana-Montes, José F
Multiple sequence alignment (MSA) is an NP-complete optimization problem found in computational biology, where the time complexity of finding an optimal alignment raises exponentially along with the number of sequences and their lengths. Additionally, to assess the quality of a MSA, a number of objectives can be taken into account, such as maximizing the sum-of-pairs, maximizing the totally conserved columns, minimizing the number of gaps, or maximizing structural information based scores such as STRIKE. An approach to deal with MSA problems is to use multi-objective metaheuristics, which are non-exact stochastic optimization methods that can produce high quality solutions to complex problems having two or more objectives to be optimized at the same time. Our motivation is to provide a multi-objective metaheuristic for MSA that can run in parallel taking advantage of multi-core-based computers. The software tool we propose, called M2Align (Multi-objective Multiple Sequence Alignment), is a parallel and more efficient version of the three-objective optimizer for sequence alignments MO-SAStrE, able of reducing the algorithm computing time by exploiting the computing capabilities of common multi-core CPU clusters. Our performance evaluation over datasets of the benchmark BAliBASE (v3.0) shows that significant time reductions can be achieved by using up to 20 cores. Even in sequential executions, M2Align is faster than MO-SAStrE, thanks to the encoding method used for the alignments. M2Align is an open source project hosted in GitHub, where the source code and sample datasets can be freely obtained: https://github.com/KhaosResearch/M2Align. firstname.lastname@example.org. Supplementary data are available at Bioinformatics online.
Rivas, Elena; Eddy, Sean R
Inference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence. However, the insertion and deletion penalties in the most widely used methods for inferring homology by sequence alignment, including BLAST and profile hidden Markov models (profile HMMs), are not based on any explicitly time-dependent evolutionary model. Using one fixed score system (BLOSUM62 with some gap open/extend costs, for example) corresponds to making an unrealistic assumption that all sequence relationships have diverged by the same time. Adoption of explicit time-dependent evolutionary models for scoring insertions and deletions in sequence alignments has been hindered by algorithmic complexity and technical difficulty. We identify and implement several probabilistic evolutionary models compatible with the affine-cost insertion/deletion model used in standard pairwise sequence alignment. Assuming an affine gap cost imposes important restrictions on the realism of the evolutionary models compatible with it, as single insertion events with geometrically distributed lengths do not result in geometrically distributed insert lengths at finite times. Nevertheless, we identify one evolutionary model compatible with symmetric pair HMMs that are the basis for Smith-Waterman pairwise alignment, and two evolutionary models compatible with standard profile-based alignment. We test different aspects of the performance of these "optimized branch length" models, including alignment accuracy and homology coverage (discrimination of residues in a homologous region from nonhomologous flanking residues). We test on benchmarks of both global homologies (full length sequence homologs) and local homologies (homologous subsequences embedded in nonhomologous sequence). Contrary to our expectations, we find that for global homologies a single long branch parameterization suffices both for distant and close homologous relationships. In contrast, we do see an advantage in
Hossain, K S M Tozammel; Patnaik, Debprakash; Laxman, Srivatsan; Jain, Prateek; Bailey-Kellogg, Chris; Ramakrishnan, Naren
We present alignment refinement by mining coupled residues (ARMiCoRe), a novel approach to a classical bioinformatics problem, viz., multiple sequence alignment (MSA) of gene and protein sequences. Aligning multiple biological sequences is a key step in elucidating evolutionary relationships, annotating newly sequenced segments, and understanding the relationship between biological sequences and functions. Classical MSA algorithms are designed to primarily capture conservations in sequences whereas couplings, or correlated mutations, are well known as an additional important aspect of sequence evolution. (Two sequence positions are coupled when mutations in one are accompanied by compensatory mutations in another). As a result, better exposition of couplings is sometimes one of the reasons for hand-tweaking of MSAs by practitioners. ARMiCoRe introduces a distinctly pattern mining approach to improving MSAs: using frequent episode mining as a foundational basis, we define the notion of a coupled pattern and demonstrate how the discovery and tiling of coupled patterns using a max-flow approach can yield MSAs that are better than conservation-based alignments. Although we were motivated to improve MSAs for the sake of better exposing couplings, we demonstrate that our MSAs are also improvements in terms of traditional metrics of assessment. We demonstrate the effectiveness of ARMiCoRe on a large collection of data sets.
Mirarab, Siavash; Nguyen, Nam; Guo, Sheng; Wang, Li-San; Kim, Junhyong; Warnow, Tandy
We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate--slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.
Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Muhammad; Awan, Ali Raza; Aslam, Naeem; Hussain, Tanveer; Naveed, Nasir; Qadri, Salman; Waheed, Usman; Shoaib, Muhammad
A comparison of 10 most popular Multiple Sequence Alignment (MSA) tools, namely, MUSCLE, MAFFT(L-INS-i), MAFFT (FFT-NS-2), T-Coffee, ProbCons, SATe, Clustal Omega, Kalign, Multalin, and Dialign-TX is presented. We also focused on the significance of some implementations embedded in algorithm of each tool. Based on 10 simulated trees of different number of taxa generated by R, 400 known alignments and sequence files were constructed using indel-Seq-Gen. A total of 4000 test alignments were generated to study the effect of sequence length, indel size, deletion rate, and insertion rate. Results showed that alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect. Overall, ProbCons was consistently on the top of list of the evaluated MSA tools. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE achieved the highest sum of pairs. We also considered BALiBASE benchmark datasets and the results relative to BAliBASE- and indel-Seq-Gen-generated alignments were consistent in the most cases. PMID:25574120
Dessimoz, Christophe; Gil, Manuel
The estimation of a distance between two biological sequences is a fundamental process in molecular evolution. It is usually performed by maximum likelihood (ML) on characters aligned either pairwise or jointly in a multiple sequence alignment (MSA). Estimators for the covariance of pairs from an MSA are known, but we are not aware of any solution for cases of pairs aligned independently. In large-scale analyses, it may be too costly to compute MSAs every time distances must be compared, and therefore a covariance estimator for distances estimated from pairs aligned independently is desirable. Knowledge of covariances improves any process that compares or combines distances, such as in generalized least-squares phylogenetic tree building, orthology inference, or lateral gene transfer detection. In this paper, we introduce an estimator for the covariance of distances from sequences aligned pairwise. Its performance is analyzed through extensive Monte Carlo simulations, and compared to the well-known variance estimator of ML distances. Our covariance estimator can be used together with the ML variance estimator to form covariance matrices. The estimator performs similarly to the ML variance estimator. In particular, it shows no sign of bias when sequence divergence is below 150 PAM units (i.e. above ~29% expected sequence identity). Above that distance, the covariances tend to be underestimated, but then ML variances are also underestimated.
Thiele, R.; Zimmer, R.; Lengauer, T.
We propose a new alignment procedure that is capable of aligning protein sequences and structures in a unified manner. Recursive dynamic programming (RDP) is a hierarchical method which, on each level of the hierarchy, identifies locally optimal solutions and assembles them into partial alignments of sequences and/or structures. In contrast to classical dynamic programming, RDP can also handle alignment problems that use objective functions not obeying the principle of prefix optimality, e.g. scoring schemes derived from energy potentials of mean force. For such alignment problems, RDP aims at computing solutions that are near-optimal with respect to the involved cost function and biologically meaningful at the same time. Towards this goal, RDP maintains a dynamic balance between different factors governing alignment fitness such as evolutionary relationships and structural preferences. As in the RDP method gaps are not scored explicitly, the problematic assignment of gap cost parameters is circumvented. In order to evaluate the RDP approach we analyse whether known and accepted multiple alignments based on structural information can be reproduced with the RDP method.
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Abstract We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for “feature-based accuracy estimator”), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality. PMID:23489379
Bodenhofer, Ulrich; Bonatesta, Enrico; Horejš-Kainrath, Christoph; Hochreiter, Sepp
Although the R platform and the add-on packages of the Bioconductor project are widely used in bioinformatics, the standard task of multiple sequence alignment has been neglected so far. The msa package, for the first time, provides a unified R interface to the popular multiple sequence alignment algorithms ClustalW, ClustalOmega and MUSCLE. The package requires no additional software and runs on all major platforms. Moreover, the msa package provides an R interface to the powerful package shade which allows for flexible and customizable plotting of multiple sequence alignments. msa is available via the Bioconductor project: http://bioconductor.org/packages/release/bioc/html/msa.html. Further information and the R code of the example presented in this paper are available at http://www.bioinf.jku.at/software/msa/. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com.
Roozgard, Aminmohammad; Barzigar, Nafise; Wang, Shuang; Jiang, Xiaoqian; Ohno-Machado, Lucila; Cheng, Samuel
Advances in DNA information extraction techniques have led to huge sequenced genomes from organisms spanning the tree of life. This increasing amount of genomic information requires tools for comparison of the nucleotide sequences. In this paper, we propose a novel nucleotide sequence alignment method based on sparse coding and belief propagation to compare the similarity of the nucleotide sequences. We used the neighbors of each nucleotide as features, and then we employed sparse coding to find a set of candidate nucleotides. To select optimum matches, belief propagation was subsequently applied to these candidate nucleotides. Experimental results show that the proposed approach is able to robustly align nucleotide sequences and is competitive to SOAPaligner  and BWA .
Danudibroto, Adriyana; Bersvendsen, Jørn; Mirea, Oana; Gerard, Olivier; D'hooge, Jan; Samset, Eigil
Temporal alignment of echocardiographic sequences enables fair comparisons of multiple cardiac sequences by showing corresponding frames at given time points in the cardiac cycle. It is also essential for spatial registration of echo volumes where several acquisitions are combined for enhancement of image quality or forming larger field of view. In this study, three different image-based temporal alignment methods were investigated. First, a method based on dynamic time warping (DTW). Second, a spline-based method that optimized the similarity between temporal characteristic curves of the cardiac cycle using 1D cubic B-spline interpolation. Third, a method based on the spline-based method with piecewise modification. These methods were tested on in-vivo data sets of 19 echo sequences. For each sequence, the mitral valve opening (MVO) time was manually annotated. The results showed that the average MVO timing error for all methods are well under the time resolution of the sequences.
Nguyen, Hung Dinh; Yoshihara, Ikuo; Yamamori, Kunihito; Yasunaga, Moritoshi
This paper presents a parallel hybrid genetic algorithm (GA) for solving the sum-of-pairs multiple protein sequence alignment. A new chromosome representation and its corresponding genetic operators are proposed. A multi-population GENITOR-type GA is combined with local search heuristics. It is then extended to run in parallel on a multiprocessor system for speeding up. Experimental results of benchmarks from the BAliBASE show that the proposed method is superior to MSA, OMA, and SAGA methods with regard to quality of solution and running time. It can be used for finding multiple sequence alignment as well as testing cost functions.
Vries, John K; Munshi, Rajan; Tobi, Dror; Klein-Seetharaman, Judith; Benos, Panayiotis V; Bahar, Ivet
Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4
Setyorini; Kuspriyanto; Widyantoro, D. H.; Pancoro, A.
Dynamic Programming (DP) remain the central algorithm of biological sequence alignment. Matching score computation is the most time-consuming process. Bit-parallelism is one of approximate string matching techniques that transform DP matrix cell unit processing into word unit (groups of cell). Bit-parallelism computate the scores column-wise. Adopting from word processing in computer system work, this technique promise reducing time in score computing process in DP matrix. In this paper, we implement bit-parallelism technique for DNA sequence alignment. Our bit-parallelism implementation have less time for score computational process but still need improvement for there construction process.
Hong, Changjin; Tewfik, Ahmed H
Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound" (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm.
Zhang, Xu; Kahveci, Tamer
We consider the problem of multiple alignment of protein sequences with the goal of achieving a large SP (Sum-of-Pairs) score. We introduce a new graph-based method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA starts with an initial alignment. It represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment through local optimizations within a window that moves greedily on the alignment. QOMA uses two parameters to permit flexibility in time/accuracy trade off: (1) The size of the window for local optimization. (2) The sparsity of the K-partite graph. Unlike traditional progressive methods, QOMA is independent of the order of sequences. The experimental results on BAliBASE benchmarks show that QOMA produces higher SP score than the existing tools including ClustalW, Probcons, Muscle, T-Coffee and DCA. The difference is more significant for distant proteins. The software is available from the authors upon request.
Lu, Yang Young; Tang, Kujin; Ren, Jie; Fuhrman, Jed A; Waterman, Michael S; Sun, Fengzhu
Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Waldmann, Jost; Gerken, Jan; Hankeln, Wolfgang; Schweer, Timmy; Glöckner, Frank Oliver
Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered. FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines. The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.
Sudha Sadasivam, G; Baktavatchalam, G
Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences.
Naznin, Farhana; Sarker, Ruhul; Essam, Daryl
Many Bioinformatics studies begin with a multiple sequence alignment as the foundation for their research. This is because multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence structure relationships. In this paper, we have proposed a Vertical Decomposition with Genetic Algorithm (VDGA) for Multiple Sequence Alignment (MSA). In VDGA, we divide the sequences vertically into two or more subsequences, and then solve them individually using a guide tree approach. Finally, we combine all the subsequences to generate a new multiple sequence alignment. This technique is applied on the solutions of the initial generation and of each child generation within VDGA. We have used two mechanisms to generate an initial population in this research: the first mechanism is to generate guide trees with randomly selected sequences and the second is shuffling the sequences inside such trees. Two different genetic operators have been implemented with VDGA. To test the performance of our algorithm, we have compared it with existing well-known methods, namely PRRP, CLUSTALX, DIALIGN, HMMT, SB_PIMA, ML_PIMA, MULTALIGN, and PILEUP8, and also other methods, based on Genetic Algorithms (GA), such as SAGA, MSA-GA and RBT-GA, by solving a number of benchmark datasets from BAliBase 2.0. The experimental results showed that the VDGA with three vertical divisions was the most successful variant for most of the test cases in comparison to other divisions considered with VDGA. The experimental results also confirmed that VDGA outperformed the other methods considered in this research.
Background Many Bioinformatics studies begin with a multiple sequence alignment as the foundation for their research. This is because multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence structure relationships. Results In this paper, we have proposed a Vertical Decomposition with Genetic Algorithm (VDGA) for Multiple Sequence Alignment (MSA). In VDGA, we divide the sequences vertically into two or more subsequences, and then solve them individually using a guide tree approach. Finally, we combine all the subsequences to generate a new multiple sequence alignment. This technique is applied on the solutions of the initial generation and of each child generation within VDGA. We have used two mechanisms to generate an initial population in this research: the first mechanism is to generate guide trees with randomly selected sequences and the second is shuffling the sequences inside such trees. Two different genetic operators have been implemented with VDGA. To test the performance of our algorithm, we have compared it with existing well-known methods, namely PRRP, CLUSTALX, DIALIGN, HMMT, SB_PIMA, ML_PIMA, MULTALIGN, and PILEUP8, and also other methods, based on Genetic Algorithms (GA), such as SAGA, MSA-GA and RBT-GA, by solving a number of benchmark datasets from BAliBase 2.0. Conclusions The experimental results showed that the VDGA with three vertical divisions was the most successful variant for most of the test cases in comparison to other divisions considered with VDGA. The experimental results also confirmed that VDGA outperformed the other methods considered in this research. PMID:21867510
Gupta, Kshitiz; Thomas, Dina; Vidya, SV; Venkatesh, KV; Ramakumar, S
Background The chemical property and biological function of a protein is a direct consequence of its primary structure. Several algorithms have been developed which determine alignment and similarity of primary protein sequences. However, character based similarity cannot provide insight into the structural aspects of a protein. We present a method based on spectral similarity to compare subsequences of amino acids that behave similarly but are not aligned well by considering amino acids as mere characters. This approach finds a similarity score between sequences based on any given attribute, like hydrophobicity of amino acids, on the basis of spectral information after partial conversion to the frequency domain. Results Distance matrices of various branches of the human kinome, that is the full complement of human kinases, were developed that matched the phylogenetic tree of the human kinome establishing the efficacy of the global alignment of the algorithm. PKCd and PKCe kinases share close biological properties and structural similarities but do not give high scores with character based alignments. Detailed comparison established close similarities between subsequences that do not have any significant character identity. We compared their known 3D structures to establish that the algorithm is able to pick subsequences that are not considered similar by character based matching algorithms but share structural similarities. Similarly many subsequences with low character identity were picked between xyna-theau and xyna-clotm F/10 xylanases. Comparison of 3D structures of the subsequences confirmed the claim of similarity in structure. Conclusion An algorithm is developed which is inspired by successful application of spectral similarity applied to music sequences. The method captures subsequences that do not align by traditional character based alignment tools but give rise to similar secondary and tertiary structures. The Spectral Similarity Score (SSS) is an
Horton, Matthew; Bodenhausen, Natacha; Bergelson, Joy
We have created a suite of Java-based software to better provide taxonomic assignments to DNA sequences. We anticipate that the program will be useful for protistologists, virologists, mycologists and other microbial ecologists. The program relies on NCBI utilities including the BLAST software and Taxonomy database and is easily manipulated at the command-line to specify a BLAST candidate's query-coverage or percent identity requirements; other options include the ability to set minimal consensus requirements (%) for each of the eight major taxonomic ranks (Domain, Kingdom, Phylum, ...) and whether to consider lower scoring candidates when the top-hit lacks taxonomic classification.
Background Multiple sequence alignments are used to study gene or protein function, phylogenetic relations, genome evolution hypotheses and even gene polymorphisms. Virtually without exception, all available tools focus on conserved segments or residues. Small divergent regions, however, are biologically important for specific quantitative polymerase chain reaction, genotyping, molecular markers and preparation of specific antibodies, and yet have received little attention. As a consequence, they must be selected empirically by the researcher. AlignMiner has been developed to fill this gap in bioinformatic analyses. Results AlignMiner is a Web-based application for detection of conserved and divergent regions in alignments of conserved sequences, focusing particularly on divergence. It accepts alignments (protein or nucleic acid) obtained using any of a variety of algorithms, which does not appear to have a significant impact on the final results. AlignMiner uses different scoring methods for assessing conserved/divergent regions, Entropy being the method that provides the highest number of regions with the greatest length, and Weighted being the most restrictive. Conserved/divergent regions can be generated either with respect to the consensus sequence or to one master sequence. The resulting data are presented in a graphical interface developed in AJAX, which provides remarkable user interaction capabilities. Users do not need to wait until execution is complete and can.even inspect their results on a different computer. Data can be downloaded onto a user disk, in standard formats. In silico and experimental proof-of-concept cases have shown that AlignMiner can be successfully used to designing specific polymerase chain reaction primers as well as potential epitopes for antibodies. Primer design is assisted by a module that deploys several oligonucleotide parameters for designing primers "on the fly". Conclusions AlignMiner can be used to reliably detect
Marass, Francesco; Upton, Chris
The volume of viral genomic sequence data continues to increase rapidly. This is especially true for the smaller RNA viruses, which are relatively easy to sequence in large numbers. The data volumes cause a number of significant problems for research applications that require large multiple alignments of essentially complete genomes, which are of the order of 10 kb. We present a simple strategy to enable the creation of large quasi-multiple sequence alignments from pairwise alignment data. This process is suitable for large, closely related sequences such as the polyproteins of dengue viruses, which need the insertion of very few indels. The quasi-multiple sequence alignments generated by KISSa are sufficiently accurate to support tree-based genome selection for interactive bioinformatics analysis tools. The speed of this process is critical to providing an interactive experience for the user.
Wala, Jeremiah; Beroukhim, Rameen
We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment.
Muth, Thilo; García-Martín, Juan A; Rausell, Antonio; Juan, David; Valencia, Alfonso; Pazos, Florencio
We have implemented in a single package all the features required for extracting, visualizing and manipulating fully conserved positions as well as those with a family-dependent conservation pattern in multiple sequence alignments. The program allows, among other things, to run different methods for extracting these positions, combine the results and visualize them in protein 3D structures and sequence spaces. JDet is a multiplatform application written in Java. It is freely available, including the source code, at http://csbg.cnb.csic.es/JDet. The package includes two of our recently developed programs for detecting functional positions in protein alignments (Xdet and S3Det), and support for other methods can be added as plug-ins. A help file and a guided tutorial for JDet are also available.
Song, Kai; Ren, Jie; Reinert, Gesine; Deng, Minghua
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data. PMID:24064230
Pellicer, S; Chen, G; Chan, K C C; Pan, Y
The public computer architecture shows promise as a platform for solving fundamental problems in bioinformatics such as global gene sequence alignment and data mining with tools such as the basic local alignment search tool (BLAST). Our implementation of these two problems on the Berkeley open infrastructure for network computing (BOINC) platform demonstrates a runtime reduction factor of 1.15 for sequence alignment and 16.76 for BLAST. While the runtime reduction factor of the global gene sequence alignment application is modest, this value is based on a theoretical sequential runtime extrapolated from the calculation of a smaller problem. Because this runtime is extrapolated from running the calculation in memory, the theoretical sequential runtime would require 37.3 GB of memory on a single system. With this in mind, the BOINC implementation not only offers the reduced runtime, but also the aggregation of the available memory of all participant nodes. If an actual sequential run of the problem were compared, a more drastic reduction in the runtime would be seen due to an additional secondary storage I/O overhead for a practical system. Despite the limitations of the public computer architecture, most notably in communication overhead, it represents a practical platform for grid- and cluster-scale bioinformatics computations today and shows great potential for future implementations.
Klaere, Steffen; Gesell, Tanja; von Haeseler, Arndt
We introduce another view of sequence evolution. Contrary to other approaches, we model the substitution process in two steps. First we assume (arbitrary) scaled branch lengths on a given phylogenetic tree. Second we allocate a Poisson distributed number of substitutions on the branches. The probability to place a mutation on a branch is proportional to its relative branch length. More importantly, the action of a single mutation on an alignment column is described by a doubly stochastic matrix, the so-called one-step mutation matrix. This matrix leads to analytical formulae for the posterior probability distribution of the number of substitutions for an alignment column.
Lauredo, Alexandre M.; Sena, Alexandre C.; de Castro, Maria Clicia S.; Leandro, Marzulo, A. J.
Finding the longest common subsequence (LCS) is an important part of DNA sequence alignment. Through dynamic programming it is possible to find the exact solution to the LCS, with space and time complexity of O(m × n), being m e n the sequence sizes. Parallel algorithms are essential, since large sequences require too much time and memory to be processed sequentially. Thus, the aim of this work is to implement and evaluate different parallel solutions for distributed memory machines, so that the amount of memory is equally divided among the various processing nodes.
Li, Andrew X; Marz, Manja; Qin, Jing; Reidys, Christian M
Many computerized methods for RNA-RNA interaction structure prediction have been developed. Recently, O(N(6)) time and O(N(4)) space dynamic programming algorithms have become available that compute the partition function of RNA-RNA interaction complexes. However, few of these methods incorporate the knowledge concerning related sequences, thus relevant evolutionary information is often neglected from the structure determination. Therefore, it is of considerable practical interest to introduce a method taking into consideration both: thermodynamic stability as well as sequence/structure covariation. We present the a priori folding algorithm ripalign, whose input consists of two (given) multiple sequence alignments (MSA). ripalign outputs (i) the partition function, (ii) base pairing probabilities, (iii) hybrid probabilities and (iv) a set of Boltzmann-sampled suboptimal structures consisting of canonical joint structures that are compatible to the alignments. Compared to the single sequence-pair folding algorithm rip, ripalign requires negligible additional memory resource but offers much better sensitivity and specificity, once alignments of suitable quality are given. ripalign additionally allows to incorporate structure constraints as input parameters. The algorithm described here is implemented in C as part of the rip package.
Sato, Tetsuya; Suyama, Mikita
Genome sequence alignments provide valuable information on many aspects of molecular biological processes. In this study, we developed a web server, GenomeCons, for manipulating multiple genome sequence alignments and their consensus sequences for high-throughput genome sequence analyses. This server facilitates the visual inspection of multiple genome sequence alignments for a set of genomic intervals at a time. This allows the user to examine how these sites are evolutionarily conserved over time for their functional importance. The server also reports consensus sequences for the input genomic intervals, which can be applied to downstream analyses such as the identification of common motifs in the regions determined by ChIP-seq experiments. GenomeCons is freely accessible at http://bioinfo.sls.kyushu-u.ac.jp/genomecons/ firstname.lastname@example.org. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com.
Jacobi, Ricardo P; Ayala-Rincón, Mauricio; Carvalho, Luis G A; Llanos, Carlos H; Hartenstein, Reiner W
Reconfigurable systolic arrays can be adapted to efficiently resolve a wide spectrum of computational problems; parallelism is naturally explored in systolic arrays and reconfigurability allows for redefinition of the interconnections and operations even during run time (dynamically). We present a reconfigurable systolic architecture that can be applied for the efficient treatment of several dynamic programming methods for resolving well-known problems, such as global and local sequence alignment, approximate string matching and longest common subsequence. The dynamicity of the reconfigurability was found to be useful for practical applications in the construction of sequence alignments. A VHDL (VHSIC hardware description language) version of this new architecture was implemented on an APEX FPGA (Field programmable gate array). It would be several magnitudes faster than the software algorithm alternatives.
Trapnell, Cole; Schatz, Michael C.
MUMmerGPU uses highly-parallel commodity graphics processing units (GPU) to accelerate the data-intensive computation of aligning next generation DNA sequence data to a reference sequence for use in diverse applications such as disease genotyping and personal genomics. MUMmerGPU 2.0 features a new stackless depth-first-search print kernel and is 13× faster than the serial CPU version of the alignment code and nearly 4× faster in total computation time than MUMmerGPU 1.0. We exhaustively examined 128 GPU data layout configurations to improve register footprint and running time and conclude higher occupancy has greater impact than reduced latency. MUMmerGPU is available open-source at http://mummergpu.sourceforge.net. PMID:20161021
Huzefa Rangwala and George Karypis March 23, 2006 Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of... Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Incremental Window-based Protein Sequence Alignment Algorithms Huzefa Rangwala and George Karypis...Then it per- forms a series of iterations in which it performs the following three steps: First, it extracts from ’ the residue-pair with the highest
Hao, Bailin; Qi, Ji; Wang, Bin
We present a brief review of a series of on-going work on bacterial phylogeny. We propose a new method to infer relatedness of prokaryotes from their complete genome data without using sequence alignment, leading to results comparable with the bacteriologist's systematics as reflected in the latest 2001 edition of Bergey's Manual of Systematic Bacteriology.1 We only touch on the mathematical aspects of the method. The biological implications of our results will be published elsewhere.
Daily, Jeffrey A.; Kalyanaraman, Anantharaman; Krishnamoorthy, Sriram; Ren, Bin
Vector extensions, such as SSE, have been part of the x86 since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. In this paper, we demonstrate that the trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. We present a practically efficient SIMD implementation of a parallel scan based sequence alignment algorithm that can better exploit wider SIMD units. We conduct comprehensive workload and use case analyses to characterize the relative behavior of the striped and scan approaches and identify the best choice of algorithm based on input length and SIMD width.
Hartmann, Alexander K
A method to calculate probability distributions in regions where the events are very unlikely (e.g., p approximately 10(-40)) is presented. The basic idea is to map the underlying model on a physical system. The system is simulated at a low temperature, such that preferably configurations with originally low probabilities are generated. Since the distribution of such a physical system is known, the original unbiased distribution can be obtained. As an application, local alignment of protein sequences is studied. The deviation of the distribution p(S) of optimum scores from the extreme-value distribution is quantified. This deviation decreases with growing sequence length.
Chavoshi, Seyed Hossein; De Baets, Bernard; Neutens, Tijs; De Tré, Guy; Van de Weghe, Nico
Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers. PMID:26181435
Thompson, Julie D; Muller, Arnaud; Waterhouse, Andrew; Procter, Jim; Barton, Geoffrey J; Plewniak, Frédéric; Poch, Olivier
Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at . PMID:16792820
Guan, X; Uberbacher, E C
Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very significant error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic programming, producing a guaranteed optimal alignment in the presence of frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is O(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs significantly better than any previously reported method.
Neuwald, A F; Liu, J S; Lipman, D J; Lawrence, C E
Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences. PMID:9108146
Ben Othman, Mohamed Tahar; Abdel-Azim, Gamil
Multiple sequence alignment (MSA) is one of the topics of bio informatics that has seriously been researched. It is known as NP-complete problem. It is also considered as one of the most important and daunting tasks in computational biology. Concerning this a wide number of heuristic algorithms have been proposed to find optimal alignment. Among these heuristic algorithms are genetic algorithms (GA). The GA has mainly two major weaknesses: it is time consuming and can cause local minima. One of the significant aspects in the GA process in MSA is to maximize the similarities between sequences by adding and shuffling the gaps of Solution Coding (SC). Several ways for SC have been introduced. One of them is the Permutation Coding (PC). We propose a hybrid algorithm based on genetic algorithms (GAs) with a PC and 2-opt algorithm. The PC helps to code the MSA solution which maximizes the gain of resources, reliability and diversity of GA. The use of the PC opens the area by applying all functions over permutations for MSA. Thus, we suggest an algorithm to calculate the scoring function for multiple alignments based on PC, which is used as fitness function. The time complexity of the GA is reduced by using this algorithm. Our GA is implemented with different selections strategies and different crossovers. The probability of crossover and mutation is set as one strategy. Relevant patents have been probed in the topic.
Grabherr, Manfred G.; Russell, Pamela; Meyer, Miriah; Mauceli, Evan; Alföldi, Jessica; Di Palma, Federica; Lindblad-Toh, Kerstin
Motivation: Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). Results: Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous ‘battleship’-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. Availability: Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/ Contact: firstname.lastname@example.org PMID:20208069
Kam, Alfred; Kwak, Daniel; Leung, Clarence; Wu, Chu; Zarour, Eleyine; Sarmenta, Luis; Blanchette, Mathieu; Waldispühl, Jérôme
Background Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server. Methodology/Principal Findings We introduce Phylo, a human-based computing framework applying “crowd sourcing” techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered. Conclusions/Significance We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of “human-brain peta-flops” of computation that are spent every day playing games. Phylo is
Kawrykow, Alexander; Roumanis, Gary; Kam, Alfred; Kwak, Daniel; Leung, Clarence; Wu, Chu; Zarour, Eleyine; Sarmenta, Luis; Blanchette, Mathieu; Waldispühl, Jérôme
Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server. We introduce Phylo, a human-based computing framework applying "crowd sourcing" techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered. We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of "human-brain peta-flops" of computation that are spent every day playing games. Phylo is available at: http://phylo.cs.mcgill.ca.
Background The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required. Results In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming. Conclusions The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing. Availability Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm". PMID:22551205
Hao, Bailin; Qi, Ji; Wang, Bin
This is a brief review of a series of on-going work on bacterial phylogeny. We have proposed a new method to infer relatedness of prokaryotes from their complete genome data without using sequence alignment. It has led to results comparable with the bacteriologists' systematics as reflected in the latest 2001 edition of the Bergey's Manual of Systematic Bacteriology1. In what follows we only touch on the mathematical aspects of the method. The biological implications of our results will be published elsewhere.
Ono, Yukiteru; Asai, Kiyoshi
Abstract Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads. Availability and Implementation: the source code is freely available at http://last.cbrc.jp/ Contact: email@example.com or firstname.lastname@example.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28039163
Wheeler, Ward C.
A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
Wheeler, Ward C.
A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
Wheeler, Ward C
A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed.
Sievers, Fabian; Wilm, Andreas; Dineen, David; Gibson, Toby J; Karplus, Kevin; Li, Weizhong; Lopez, Rodrigo; McWilliam, Hamish; Remmert, Michael; Söding, Johannes; Thompson, Julie D; Higgins, Desmond G
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam. PMID:21988835
Sievers, Fabian; Wilm, Andreas; Dineen, David; Gibson, Toby J; Karplus, Kevin; Li, Weizhong; Lopez, Rodrigo; McWilliam, Hamish; Remmert, Michael; Söding, Johannes; Thompson, Julie D; Higgins, Desmond G
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
Costantini, Susan; Colonna, Giovanni; Facchiano, Angelo M
Multiple sequence alignments are successfully applied in many studies for under- standing the structural and functional relations among single nucleic acids and protein sequences as well as whole families. Because of the rapid growth of sequence databases, multiple sequence alignments can often be very large and difficult to visualize and analyze. We offer a new service aimed to visualize and analyze the multiple alignments obtained with different external algorithms, with new features useful for the comparison of the aligned sequences as well as for the creation of a final image of the alignment. The service is named FASMA and is available at http://bioinformatica.isa.cnr.it/FASMA/.
Shih, Arthur Chun-Chieh; Lee, D T; Lin, Laurent; Peng, Chin-Lin; Chen, Shiang-Heng; Wu, Yu-Wei; Wong, Chun-Yi; Chou, Meng-Yuan; Shiao, Tze-Chang; Hsieh, Mu-Fen
Deluged by the rate and complexity of completed genomic sequences, the need to align longer sequences becomes more urgent, and many more tools have thus been developed. In the initial stage of genomic sequence analysis, a biologist is usually faced with the questions of how to choose the best tool to align sequences of interest and how to analyze and visualize the alignment results, and then with the question of whether poorly aligned regions produced by the tool are indeed not homologous or are just results due to inappropriate alignment tools or scoring systems used. Although several systematic evaluations of multiple sequence alignment (MSA) programs have been proposed, they may not provide a standard-bearer for most biologists because those poorly aligned regions in these evaluations are never discussed. Thus, a tool that allows cross comparison of the alignment results obtained by different tools simultaneously could help a biologist evaluate their correctness and accuracy. In this paper, we present a versatile alignment visualization system, called SinicView, (for Sequence-aligning INnovative and Interactive Comparison VIEWer), which allows the user to efficiently compare and evaluate assorted nucleotide alignment results obtained by different tools. SinicView calculates similarity of the alignment outputs under a fixed window using the sum-of-pairs method and provides scoring profiles of each set of aligned sequences. The user can visually compare alignment results either in graphic scoring profiles or in plain text format of the aligned nucleotides along with the annotations information. We illustrate the capabilities of our visualization system by comparing alignment results obtained by MLAGAN, MAVID, and MULTIZ, respectively. With SinicView, users can use their own data sequences to compare various alignment tools or scoring systems and select the most suitable one to perform alignment in the initial stage of sequence analysis.
Qi, Zhi; Redding, Sy; Lee, Ja Yil; Gibb, Bryan; Kwon, YoungHo; Niu, Hengyao; Gaines, William A; Sung, Patrick; Greene, Eric C
Homologous recombination (HR) mediates the exchange of genetic information between sister or homologous chromatids. During HR, members of the RecA/Rad51 family of recombinases must somehow search through vast quantities of DNA sequence to align and pair single-strand DNA (ssDNA) with a homologous double-strand DNA (dsDNA) template. Here, we use single-molecule imaging to visualize Rad51 as it aligns and pairs homologous DNA sequences in real time. We show that Rad51 uses a length-based recognition mechanism while interrogating dsDNA, enabling robust kinetic selection of 8-nucleotide (nt) tracts of microhomology, which kinetically confines the search to sites with a high probability of being a homologous target. Successful pairing with a ninth nucleotide coincides with an additional reduction in binding free energy, and subsequent strand exchange occurs in precise 3-nt steps, reflecting the base triplet organization of the presynaptic complex. These findings provide crucial new insights into the physical and evolutionary underpinnings of DNA recombination. Copyright © 2015 Elsevier Inc. All rights reserved.
Barson, Gemma; Griffiths, Ed
Manual annotation is essential to create high-quality reference alignments and annotation. Annotators need to be able to view sequence alignments in detail. The SeqTools package provides three tools for viewing different types of sequence alignment: Blixem is a many-to-one browser of pairwise alignments, displaying multiple match sequences aligned against a single reference sequence; Dotter provides a graphical dot-plot view of a single pairwise alignment; and Belvu is a multiple sequence alignment viewer, editor, and phylogenetic tool. These tools were originally part of the AceDB genome database system but have been completely rewritten to make them generally available as a standalone package of greatly improved function. Blixem is used by annotators to give a detailed view of the evidence for particular gene models. Blixem displays the gene model positions and the match sequences aligned against the genomic reference sequence. Annotators use this for many reasons, including to check the quality of an alignment, to find missing/misaligned sequence and to identify splice sites and polyA sites and signals. Dotter is used to give a dot-plot representation of a particular pairwise alignment. This is used to identify sequence that is not represented (or is misrepresented) and to quickly compare annotated gene models with transcriptional and protein evidence that putatively supports them. Belvu is used to analyse conservation patterns in multiple sequence alignments and to perform a combination of manual and automatic processing of the alignment. High-quality reference alignments are essential if they are to be used as a starting point for further automatic alignment generation. While there are many different alignment tools available, the SeqTools package provides unique functionality that annotators have found to be essential for analysing sequence alignments as part of the manual annotation process.
Frenkel, Zakharia M
To establish possible function of a newly discovered protein, alignment of its sequence with other known sequences is required. When the similarity is marginal, the function remains uncertain. A principally new approach is suggested: to use networks in the protein sequence space. The functionality of the protein is firmly established via networks forming chains of consecutive pair-wise matching fragments. The distant relatives are, thus, considered as relatives, though in some cases, there is even no sequence match between the ends of the chain, while the entire chain belongs to the same functional and structural network.
Dong, Min; Graham, Mitchell; Yadav, Nehul
Kim, Changhoon; Tai, Chin-Hsien; Lee, Byungkook
Background Accurate sequence alignment is required in many bioinformatics applications but, when sequence similarity is low, it is difficult to obtain accurate alignments based on sequence similarity alone. The accuracy improves when the structures are available, but current structure-based sequence alignment procedures still mis-align substantial numbers of residues. In order to correct such errors, we previously explored the possibility of replacing the residue-based dynamic programming algorithm in structure alignment procedures with the Seed Extension algorithm, which does not use a gap penalty. Here, we describe a new procedure called RSE (Refinement with Seed Extension) that iteratively refines a structure-based sequence alignment. Results RSE uses SE (Seed Extension) in its core, which is an algorithm that we reported recently for obtaining a sequence alignment from two superimposed structures. The RSE procedure was evaluated by comparing the correctly aligned fractions of residues before and after the refinement of the structure-based sequence alignments produced by popular programs. CE, DaliLite, FAST, LOCK2, MATRAS, MATT, TM-align, SHEBA and VAST were included in this analysis and the NCBI's CDD root node set was used as the reference alignments. RSE improved the average accuracy of sequence alignments for all programs tested when no shift error was allowed. The amount of improvement varied depending on the program. The average improvements were small for DaliLite and MATRAS but about 5% for CE and VAST. More substantial improvements have been seen in many individual cases. The additional computation times required for the refinements were negligible compared to the times taken by the structure alignment programs. Conclusion RSE is a computationally inexpensive way of improving the accuracy of a structure-based sequence alignment. It can be used as a standalone procedure following a regular structure-based sequence alignment or to replace the traditional
Balech, Bachir; Vicario, Saverio; Donvito, Giacinto; Monaco, Alfonso; Notarangelo, Pasquale; Pesole, Graziano
Here we present the MSA-PAD application, a DNA multiple sequence alignment framework that uses PFAM protein domain information to align DNA sequences encoding either single or multiple protein domains. MSA-PAD has two alignment options: gene and genome mode.
Zheng, Qi; Grice, Elizabeth A.
Accurate mapping of next-generation sequencing (NGS) reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely) mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or "best" hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit "AlignerBoost", which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost’s algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost. PMID:27706155
Zheng, Qi; Grice, Elizabeth A
Accurate mapping of next-generation sequencing (NGS) reads to reference genomes is crucial for almost all NGS applications and downstream analyses. Various repetitive elements in human and other higher eukaryotic genomes contribute in large part to ambiguously (non-uniquely) mapped reads. Most available NGS aligners attempt to address this by either removing all non-uniquely mapping reads, or reporting one random or "best" hit based on simple heuristics. Accurate estimation of the mapping quality of NGS reads is therefore critical albeit completely lacking at present. Here we developed a generalized software toolkit "AlignerBoost", which utilizes a Bayesian-based framework to accurately estimate mapping quality of ambiguously mapped NGS reads. We tested AlignerBoost with both simulated and real DNA-seq and RNA-seq datasets at various thresholds. In most cases, but especially for reads falling within repetitive regions, AlignerBoost dramatically increases the mapping precision of modern NGS aligners without significantly compromising the sensitivity even without mapping quality filters. When using higher mapping quality cutoffs, AlignerBoost achieves a much lower false mapping rate while exhibiting comparable or higher sensitivity compared to the aligner default modes, therefore significantly boosting the detection power of NGS aligners even using extreme thresholds. AlignerBoost is also SNP-aware, and higher quality alignments can be achieved if provided with known SNPs. AlignerBoost's algorithm is computationally efficient, and can process one million alignments within 30 seconds on a typical desktop computer. AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost.
Chowdhury, Biswanath; Garai, Gautam
Sequence alignment is an active research area in the field of bioinformatics. It is also a crucial task as it guides many other tasks like phylogenetic analysis, function, and/or structure prediction of biological macromolecules like DNA, RNA, and Protein. Proteins are the building blocks of every living organism. Although protein alignment problem has been studied for several decades, unfortunately, every available method produces alignment results differently for a single alignment problem. Multiple sequence alignment is characterized as a very high computational complex problem. Many stochastic methods, therefore, are considered for improving the accuracy of alignment. Among them, many researchers frequently use Genetic Algorithm. In this study, we have shown different types of the method applied in alignment and the recent trends in the multiobjective genetic algorithm for solving multiple sequence alignment. Many recent studies have demonstrated considerable progress in finding the alignment accuracy. Copyright © 2017 Elsevier Inc. All rights reserved.
Background Accurate and efficient structural alignment of non-coding RNAs (ncRNAs) has grasped more and more attentions as recent studies unveiled the significance of ncRNAs in living organisms. While the Sankoff style structural alignment algorithms cannot efficiently serve for multiple sequences, mostly progressive schemes are used to reduce the complexity. However, this idea tends to propagate the early stage errors throughout the entire process, thereby degrading the quality of the final alignment. For multiple protein sequence alignment, we have recently proposed PicXAA which constructs an accurate alignment in a non-progressive fashion. Results Here, we propose PicXAA-R as an extension to PicXAA for greedy structural alignment of ncRNAs. PicXAA-R efficiently grasps both folding information within each sequence and local similarities between sequences. It uses a set of probabilistic consistency transformations to improve the posterior base-pairing and base alignment probabilities using the information of all sequences in the alignment. Using a graph-based scheme, we greedily build up the structural alignment from sequence regions with high base-pairing and base alignment probabilities. Conclusions Several experiments on datasets with different characteristics confirm that PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs and it consistently yields accurate alignment results, especially for datasets with locally similar sequences. PicXAA-R source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/. PMID:21342569
Efforts to shift the toxicity testing paradigm from whole organism studies to those focused on the initiation of toxicity and relevant pathways have led to increased utilization of in vitro and in silico methods. Hence the emergence of high through-put screening (HTS) programs, such as U.S. EPA ToxCast, and application of the adverse outcome pathway (AOP) framework for identifying and defining biological key events triggered upon perturbation of molecular initiating events and leading to adverse outcomes occuring at a level of organization relevant for risk assessment . With these recent initiatives to harness the power of “the pathway” in describing and evaluating toxicity comes the need to extrapolate data beyond the model species. Sequence alignment to predict across-species susceptibilty (SeqAPASS) is a web-based tool that allows the user to begin to understand how broadly HTS data or AOP constructs may plausibly be extrapolated across species, while describing the relative intrinsic susceptibiltiy of different taxa to chemicals with known modes of action (e.g., pharmaceuticals and pesticides). The tool rapidly and strategically assesses available molecular target information to describe protein sequence similarity at the primary amino acid sequence, conserved domain, and individual amino acid residue levels. This in silico approach to species extrapolation was designed to automate and streamline the relatively complex and time-consuming process of co
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/. PMID:24580807
Gao, C; Wang, B; Zhou, C J; Zhang, Q
In bioinformatics, sequence alignment is one of the most common problems. Multiple sequence alignment is an NP (nondeterministic polynomial time) problem, which requires further study and exploration. The chaos optimization algorithm is a type of chaos theory, and a procedure for combining the genetic algorithm (GA), which uses ergodicity, and inherent randomness of chaotic iteration. It is an efficient method to solve the basic premature phenomenon of the GA. Applying the Logistic map to the GA and using chaotic sequences to carry out the chaotic perturbation can improve the convergence of the basic GA. In addition, the random tournament selection and optimal preservation strategy are used in the GA. Experimental evidence indicates good results for this process.
Nair, Pradeep S; John, Eugene B
Aligning specific sequences against a very large number of other sequences is a central aspect of bioinformatics. With the widespread availability of personal computers in biology laboratories, sequence alignment is now often performed locally. This makes it necessary to analyse the performance of personal computers for sequence aligning bioinformatics benchmarks. In this paper, we analyse the performance of a personal computer for the popular BLAST and FASTA sequence alignment suites. Results indicate that these benchmarks have a large number of recurring operations and use memory operations extensively. It seems that the performance can be improved with a bigger L1-cache.
Song, Kai; Ren, Jie; Zhai, Zhiyuan; Liu, Xuemei; Deng, Minghua; Sun, Fengzhu
Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D(2), D(*)(2) and D(s)(2), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both D(*)(2) and D(s)(2), outperform D(2) for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of D(*)(2) and D(s)(2). Finally, variations of these statistics, d(2), d(*)(2) and d(s)(2), respectively, are used to first cluster five mammalian species with known phylogenetic relationships, and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using d(s)(2) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic d(s)(2) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.
Sequencing research requires efficient computation. Few programs use already known information about DNA variants when aligning sequence data to the reference map. New program findmap.f90 reads the previous variant list before aligning sequence, calling variant alleles, and summing the allele counts...
Brandt, Bernd W; Feenstra, K Anton; Heringa, Jaap
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein-protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww.
Brandt, Bernd W.; Feenstra, K. Anton; Heringa, Jaap
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww. PMID:20525785
Hunt, Fern Y; Kearsley, Anthony J; O'Gallagher, Agnes
Current methods for aligning biological sequences are based on dynamic programming algorithms. If large numbers of sequences or a number of long sequences are to be aligned, the required computations are expensive in memory and central processing unit (CPU) time. In an attempt to bring the tools of large-scale linear programming (LP) methods to bear on this problem, we formulate the alignment process as a controlled Markov chain and construct a suggested alignment based on policies that minimise the expected total cost of the alignment. We discuss the LP associated with the total expected discounted cost and show the results of a solution of the problem based on a primal-dual interior point method. Model parameters, estimated from aligned sequences, along with cost function parameters are used to construct the objective and constraint conditions of the LP problem. This article concludes with a discussion of some alignments obtained from the LP solutions of problems with various cost function parameter values.
Fieth, Pascal; Hartmann, Alexander K.
Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to multiple sequence alignments with gaps, which are much more relevant for practical applications in molecular biology. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10-160, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments.
Existing tools for multiple-sequence alignment focus on aligning protein sequence or protein-coding DNA sequence, and are often based on extensions to Needleman-Wunsch-like pairwise alignment methods. We introduce a new tool, Sigma, with a new algorithm and scoring scheme designed specifically for non-coding DNA sequence. This problem acquires importance with the increasing number of published sequences of closely-related species. In particular, studies of gene regulation seek to take advantage of comparative genomics, and recent algorithms for finding regulatory sites in phylogenetically-related intergenic sequence require alignment as a preprocessing step. Much can also be learned about evolution from intergenic DNA, which tends to evolve faster than coding DNA. Sigma uses a strategy of seeking the best possible gapless local alignments (a strategy earlier used by DiAlign), at each step making the best possible alignment consistent with existing alignments, and scores the significance of the alignment based on the lengths of the aligned fragments and a background model which may be supplied or estimated from an auxiliary file of intergenic DNA. Comparative tests of sigma with five earlier algorithms on synthetic data generated to mimic real data show excellent performance, with Sigma balancing high "sensitivity" (more bases aligned) with effective filtering of "incorrect" alignments. With real data, while "correctness" can't be directly quantified for the alignment, running the PhyloGibbs motif finder on pre-aligned sequence suggests that Sigma's alignments are superior. By taking into account the peculiarities of non-coding DNA, Sigma fills a gap in the toolbox of bioinformatics.
Taly, Jean-Francois; Magis, Cedrik; Bussotti, Giovanni; Chang, Jia-Ming; Di Tommaso, Paolo; Erb, Ionas; Espinosa-Carrasco, Jose; Kemena, Carsten; Notredame, Cedric
T-Coffee (Tree-based consistency objective function for alignment evaluation) is a versatile multiple sequence alignment (MSA) method suitable for aligning most types of biological sequences. The main strength of T-Coffee is its ability to combine third party aligners and to integrate structural (or homology) information when building MSAs. The series of protocols presented here show how the package can be used to multiply align proteins, RNA and DNA sequences. The protein section shows how users can select the most suitable T-Coffee mode for their data set. Detailed protocols include T-Coffee, the default mode, M-Coffee, a meta version able to combine several third party aligners into one, PSI (position-specific iterated)-Coffee, the homology extended mode suitable for remote homologs and Expresso, the structure-based multiple aligner. We then also show how the T-RMSD (tree based on root mean square deviation) option can be used to produce a functionally informative structure-based clustering. RNA alignment procedures are described for using R-Coffee, a mode able to use predicted RNA secondary structures when aligning RNA sequences. DNA alignments are illustrated with Pro-Coffee, a multiple aligner specific of promoter regions. We also present some of the many reformatting utilities bundled with T-Coffee. The package is an open-source freeware available from http://www.tcoffee.org/.
Tabei, Yasuo; Asai, Kiyoshi
Non-coding RNAs (ncRNAs) show a unique evolutionary process in which the substitutions of distant bases are correlated in order to conserve the secondary structure of the ncRNA molecule. Therefore, the multiple alignment method for the detection of ncRNAs should take into account both the primary sequence and the secondary structure. Recently, there has been intense focus on multiple alignment investigations for the detection of ncRNAs; however, most of the proposed methods are designed for global multiple alignments. For this reason, these methods are not appropriate to identify locally conserved ncRNAs among genomic sequences. A more efficient local multiple alignment method for the detection of ncRNAs is required. We propose a new local multiple alignment method for the detection of ncRNAs. This method uses a local multiple alignment construction procedure inspired by ProDA, which is a local multiple aligner program for protein sequences with repeated and shuffled elements. To align sequences based on secondary structure information, we propose a new alignment model which incorporates secondary structure features. We define the conditional probability of an alignment via a conditional random field and use a gamma-centroid estimator to align sequences. The locally aligned subsequences are clustered into blocks of approximately globally alignable subsequences between pairwise alignments. Finally, these blocks are multiply aligned via MXSCARNA. In benchmark experiments, we demonstrate the high ability of the implemented software, SCARNA_LM, for local multiple alignment for the detection of ncRNAs. The C++ source code for SCARNA_LM and its experimental datasets are available at http://www.ncrna.org/software/scarna_lm/download. Supplementary data are available at Bioinformatics online.
Quinn, Terrance; Sinkala, Zachariah
We develop a general method for computing extreme value distribution (Gumbel, 1958) parameters for gapped alignments. Our approach uses mixture distribution theory to obtain associated BLOSUM matrices for gapped alignments, which in turn are used for determining significance of gapped alignment scores for pairs of biological sequences. We compare our results with parameters already obtained in the literature.
Lu, Chin Lung; Huang, Yen Pin
Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in (gamman2) memory, where gamma is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences only. We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only O(alphan) space, where alpha is the sum of the lengths of constraints and usually alpha < n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. http://genome.life.nctu.edu.tw/MUSICME.
Liu, Guozhen; Uddin, Monica; Islam, Munirul; Goodman, Morris; Grossman, Lawrence I; Romero, Roberto; Wildman, Derek E
Background Rapidly accumulating genome sequence data from multiple species offer powerful opportunities for the detection of DNA sequence evolution. Phylogenetic tree construction and codon-based tests for natural selection are the prevailing tools used to detect functionally important evolutionary change in protein coding sequences. These analyses often require multiple DNA sequence alignments that maintain the correct reading frame for each collection of putative orthologous sequences. Since this feature is not available in most alignment tools, codon reading frames often must be checked manually before evolutionary analyses can commence. Results Here we report an online codon-preserved alignment tool (OCPAT) that generates multiple sequence alignments automatically from the coding sequences of any list of human gene IDs and their putative orthologs from genomes of other vertebrate tetrapods. OCPAT is programmed to extract putative orthologous genes from genomes and to align the orthologs with the reading frame maintained in all species. OCPAT also optimizes the alignment by trimming the most variable alignment regions at the 5' and 3' ends of each gene. The resulting output of alignments is returned in several formats, which facilitates further molecular evolutionary analyses by appropriate available software. Alignments are generally robust and reliable, retaining the correct reading frame. The tool can serve as the first step for comparative genomic analyses of protein-coding gene sequences including phylogenetic tree reconstruction and detection of natural selection. We aligned 20,658 human RefSeq mRNAs using OCPAT. Most alignments are missing sequence(s) from at least one species; however, functional annotation clustering of the ~1700 transcripts that were alignable to all species shows that genes involved in multi-subunit protein complexes are highly conserved. Conclusion The OCPAT program facilitates large-scale evolutionary and phylogenetic analyses of
Altschul, Stephen F.; Wootton, John C.; Zaslavsky, Elena; Yu, Yi-Kuo
Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct “BILD” (“Bayesian Integral Log-odds”) substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum. PMID:20657661
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena; Yu, Yi-Kuo
Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct "BILD" ("Bayesian Integral Log-odds") substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.
Katoh, Kazutaka; Standley, Daron M
We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.
Background Alignment of amino acid sequences by means of dynamic programming is a cornerstone sequence comparison method. The quality of alignments produced by dynamic programming critically depends on the choice of the alignment scoring function. Therefore, for a specific alignment problem one needs a way of selecting the best performing scoring function. This work is focused on the issue of finding optimized protein family- and fold-specific scoring functions for global similarity matrix-based sequence alignment. Findings I utilize a comprehensive set of reference alignments obtained from structural superposition of homologous and analogous proteins to design a quantitative statistical framework for evaluating the performance of alignment scoring functions in global pairwise sequence alignment. This framework is applied to study how existing general-purpose amino acid similarity matrices perform on individual protein families and structural folds, and to compare them to family-specific and fold-specific matrices derived in this work. I describe an adaptive alignment procedure that automatically selects an appropriate similarity matrix and optimized gap penalties based on the properties of the sequences being aligned. Conclusions The results of this work indicate that using family-specific similarity matrices significantly improves the quality of the alignment of homologous sequences over the traditional sequence alignment based on a single general-purpose similarity matrix. However, using fold-specific similarity matrices can only marginally improve sequence alignment of proteins that share the same structural fold but do not share a common evolutionary origin. The family-specific matrices derived in this work and the optimized gap penalties are available at http://taurus.crc.albany.edu/fsm. PMID:21846354
Roca, Alberto I; Almada, Albert E; Abajian, Aaron C
Background Multiple sequence alignments are a fundamental tool for the comparative analysis of proteins and nucleic acids. However, large data sets are no longer manageable for visualization and investigation using the traditional stacked sequence alignment representation. Results We introduce ProfileGrids that represent a multiple sequence alignment as a matrix color-coded according to the residue frequency occurring at each column position. JProfileGrid is a Java application for computing and analyzing ProfileGrids. A dynamic interaction with the alignment information is achieved by changing the ProfileGrid color scheme, by extracting sequence subsets at selected residues of interest, and by relating alignment information to residue physical properties. Conserved family motifs can be identified by the overlay of similarity plot calculations on a ProfileGrid. Figures suitable for publication can be generated from the saved spreadsheet output of the colored matrices as well as by the export of conservation information for use in the PyMOL molecular visualization program. We demonstrate the utility of ProfileGrids on 300 bacterial homologs of the RecA family – a universally conserved protein involved in DNA recombination and repair. Careful attention was paid to curating the collected RecA sequences since ProfileGrids allow the easy identification of rare residues in an alignment. We relate the RecA alignment sequence conservation to the following three topics: the recently identified DNA binding residues, the unexplored MAW motif, and a unique Bacillus subtilis RecA homolog sequence feature. Conclusion ProfileGrids allow large protein families to be visualized more effectively than the traditional stacked sequence alignment form. This new graphical representation facilitates the determination of the sequence conservation at residue positions of interest, enables the examination of structural patterns by using residue physical properties, and permits the display
Shang, Lei; Gardner, David P; Xu, Weijia; Cannone, Jamie J; Miranker, Daniel P; Ozer, Stuart; Gutell, Robin R
The analysis of RNA sequences, once a small niche field for a small collection of scientists whose primary emphasis was the structure and function of a few RNA molecules, has grown most significantly with the realizations that 1) RNA is implicated in many more functions within the cell, and 2) the analysis of ribosomal RNA sequences is revealing more about the microbial ecology within all biological and environmental systems. The accurate and rapid alignment of these RNA sequences is essential to decipher the maximum amount of information from this data. Two computer systems that utilize the Gutell lab's RNA Comparative Analysis Database (rCAD) were developed to align sequences to an existing template alignment available at the Gutell lab's Comparative RNA Web (CRW) Site. Multiple dimensions of cross-indexed information are contained within the relational database--rCAD, including sequence alignments, the NCBI phylogenetic tree, and comparative secondary structure information for each aligned sequence. The first program, CRWAlign-1 creates a phylogenetic-based sequence profile for each column in the alignment. The second program, CRWAlign-2 creates a profile based on phylogenetic, secondary structure, and sequence information. Both programs utilize their profiles to align new sequences into the template alignment. The accuracies of the two CRWAlign programs were compared with the best template-based rRNA alignment programs and the best de-novo alignment programs. We have compared our programs with a total of eight alternative alignment methods on different sets of 16S rRNA alignments with sequence percent identities ranging from 50% to 100%. Both CRWAlign programs were superior to these other programs in accuracy and speed. Both CRWAlign programs can be used to align the very extensive amount of RNA sequencing that is generated due to the rapid next-generation sequencing technology. This latter technology is augmenting the new paradigm that RNA is intimately
Background The analysis of RNA sequences, once a small niche field for a small collection of scientists whose primary emphasis was the structure and function of a few RNA molecules, has grown most significantly with the realizations that 1) RNA is implicated in many more functions within the cell, and 2) the analysis of ribosomal RNA sequences is revealing more about the microbial ecology within all biological and environmental systems. The accurate and rapid alignment of these RNA sequences is essential to decipher the maximum amount of information from this data. Methods Two computer systems that utilize the Gutell lab's RNA Comparative Analysis Database (rCAD) were developed to align sequences to an existing template alignment available at the Gutell lab's Comparative RNA Web (CRW) Site. Multiple dimensions of cross-indexed information are contained within the relational database - rCAD, including sequence alignments, the NCBI phylogenetic tree, and comparative secondary structure information for each aligned sequence. The first program, CRWAlign-1 creates a phylogenetic-based sequence profile for each column in the alignment. The second program, CRWAlign-2 creates a profile based on phylogenetic, secondary structure, and sequence information. Both programs utilize their profiles to align new sequences into the template alignment. Results The accuracies of the two CRWAlign programs were compared with the best template-based rRNA alignment programs and the best de-novo alignment programs. We have compared our programs with a total of eight alternative alignment methods on different sets of 16S rRNA alignments with sequence percent identities ranging from 50% to 100%. Both CRWAlign programs were superior to these other programs in accuracy and speed. Conclusions Both CRWAlign programs can be used to align the very extensive amount of RNA sequencing that is generated due to the rapid next-generation sequencing technology. This latter technology is augmenting the
Barlowe, Scott; Coan, Heather B; Youker, Robert T
Coan, Heather B.; Youker, Robert T.
Deng, Xin; Cheng, Jianlin
Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields. We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores. MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.
Profiti, Giuseppe; Fariselli, Piero; Casadio, Rita
The next-generation sequencing era requires reliable, fast and efficient approaches for the accurate annotation of the ever-increasing number of biological sequences and their variations. Transfer of annotation upon similarity search is a standard approach. The procedure of all-against-all protein comparison is a preliminary step of different available methods that annotate sequences based on information already present in databases. Given the actual volume of sequences, methods are necessary to pre-process data to reduce the time of sequence comparison. We present an algorithm that optimizes the partition of a large volume of sequences (the whole database) into sets where sequence length values (in residues) are constrained depending on a bounded minimal and expected alignment coverage. The idea is to optimally group protein sequences according to their length, and then computing the all-against-all sequence alignments among sequences that fall in a selected length range. We describe a mathematically optimal solution and we show that our method leads to a 5-fold speed-up in real world cases. The software is available for downloading at http://www.biocomp.unibo.it/∼giuseppe/partitioning.html. email@example.com. Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: firstname.lastname@example.org.
Malhis, Nawar; Butterfield, Yaron S N; Ester, Martin; Jones, Steven J M
A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this article, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base's probabilities given in the prb files. Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNP predictions are more accurate than other SNP prediction approaches used today that start from the most probable sequence, including those using base quality.
Abascal, Federico; Zardoya, Rafael; Telford, Maximilian J
We present TranslatorX, a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations. Many comparisons between biological sequences (nucleic acids and proteins) involve the construction of multiple alignments. Alignments represent a statement regarding the homology between individual nucleotides or amino acids within homologous genes. As protein-coding DNA sequences evolve as triplets of nucleotides (codons) and it is known that sequence similarity degrades more rapidly at the DNA than at the amino acid level, alignments are generally more accurate when based on amino acids than on their corresponding nucleotides. TranslatorX novelties include: (i) use of all documented genetic codes and the possibility of assigning different genetic codes for each sequence; (ii) a battery of different multiple alignment programs; (iii) translation of ambiguous codons when possible; (iv) an innovative criterion to clean nucleotide alignments with GBlocks based on protein information; and (v) a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments. The TranslatorX server is freely available at http://translatorx.co.uk.
Remmert, Michael; Biegert, Andreas; Hauser, Andreas; Söding, Johannes
Sequence-based protein function and structure prediction depends crucially on sequence-search sensitivity and accuracy of the resulting sequence alignments. We present an open-source, general-purpose tool that represents both query and database sequences by profile hidden Markov models (HMMs): 'HMM-HMM-based lightning-fast iterative sequence search' (HHblits; http://toolkit.genzentrum.lmu.de/hhblits/). Compared to the sequence-search tool PSI-BLAST, HHblits is faster owing to its discretized-profile prefilter, has 50-100% higher sensitivity and generates more accurate alignments.
Aniba, Mohamed Radhouene; Poch, Olivier; Marchler-Bauer, Aron; Thompson, Julie Dawn
Multiple sequence alignment (MSA) is a cornerstone of modern molecular biology and represents a unique means of investigating the patterns of conservation and diversity in complex biological systems. Many different algorithms have been developed to construct MSAs, but previous studies have shown that no single aligner consistently outperforms the rest. This has led to the development of a number of 'meta-methods' that systematically run several aligners and merge the output into one single solution. Although these methods generally produce more accurate alignments, they are inefficient because all the aligners need to be run first and the choice of the best solution is made a posteriori. Here, we describe the development of a new expert system, AlexSys, for the multiple alignment of protein sequences. AlexSys incorporates an intelligent inference engine to automatically select an appropriate aligner a priori, depending only on the nature of the input sequences. The inference engine was trained on a large set of reference multiple alignments, using a novel machine learning approach. Applying AlexSys to a test set of 178 alignments, we show that the expert system represents a good compromise between alignment quality and running time, making it suitable for high throughput projects. AlexSys is freely available from http://alnitak.u-strasbg.fr/∼aniba/alexsys.
Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497
Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver
In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.
Daily, Jeffrey A.
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
Background Global positioning systems (GPS) are increasingly being used in health research to determine the location of study participants. Combining GPS data with data collected via travel/activity diaries allows researchers to assess where people travel in conjunction with data about trip purpose and accompaniment. However, linking GPS and diary data is problematic and to date the only method has been to match the two datasets manually, which is time consuming and unlikely to be practical for larger data sets. This paper assesses the feasibility of a new sequence alignment method of linking GPS and travel diary data in comparison with the manual matching method. Methods GPS and travel diary data obtained from a study of children's independent mobility were linked using sequence alignment algorithms to test the proof of concept. Travel diaries were assessed for quality by counting the number of errors and inconsistencies in each participant's set of diaries. The success of the sequence alignment method was compared for higher versus lower quality travel diaries, and for accompanied versus unaccompanied trips. Time taken and percentage of trips matched were compared for the sequence alignment method and the manual method. Results The sequence alignment method matched 61.9% of all trips. Higher quality travel diaries were associated with higher match rates in both the sequence alignment and manual matching methods. The sequence alignment method performed almost as well as the manual method and was an order of magnitude faster. However, the sequence alignment method was less successful at fully matching trips and at matching unaccompanied trips. Conclusions Sequence alignment is a promising method of linking GPS and travel diary data in large population datasets, especially if limitations in the trip detection algorithm are addressed. PMID:22142322
Feng, Weixing; Sang, Peichao; Lian, Deyuan; Dong, Yansheng; Song, Fengfei; Li, Meng; He, Bo; Cao, Fenglin; Liu, Yunlong
Next-generation short-read sequencing is widely utilized in genomic studies. Biological applications require an alignment step to map sequencing reads to the reference genome, before acquiring expected genomic information. This requirement makes alignment accuracy a key factor for effective biological interpretation. Normally, when accounting for measurement errors and single nucleotide polymorphisms, short read mappings with a few mismatches are generally considered acceptable. However, to further improve the efficiency of short-read sequencing alignment, we propose a method to retrieve additional reliably aligned reads (reads with more than a pre-defined number of mismatches), using a Bayesian-based approach. In this method, we first retrieve the sequence context around the mismatched nucleotides within the already aligned reads; these loci contain the genomic features where sequencing errors occur. Then, using the derived pattern, we evaluate the remaining (typically discarded) reads with more than the allowed number of mismatches, and calculate a score that represents the probability that a specific alignment is correct. This strategy allows the extraction of more reliably aligned reads, therefore improving alignment sensitivity. The source code of our tool, ResSeq, can be downloaded from: https://github.com/hrbeubiocenter/Resseq.
Domingues, F S; Lackner, P; Andreeva, A; Sippl, M J
The biological role, biochemical function, and structure of uncharacterized protein sequences is often inferred from their similarity to known proteins. A constant goal is to increase the reliability, sensitivity, and accuracy of alignment techniques to enable the detection of increasingly distant relationships. Development, tuning, and testing of these methods benefit from appropriate benchmarks for the assessment of alignment accuracy.Here, we describe a benchmark protocol to estimate sequence-to-sequence and sequence-to-structure alignment accuracy. The protocol consists of structurally related pairs of proteins and procedures to evaluate alignment accuracy over the whole set. The set of protein pairs covers all the currently known fold types. The benchmark is challenging in the sense that it consists of proteins lacking clear sequence similarity. Correct target alignments are derived from the three-dimensional structures of these pairs by rigid body superposition. An evaluation engine computes the accuracy of alignments obtained from a particular algorithm in terms of alignment shifts with respect to the structure derived alignments. Using this benchmark we estimate that the best results can be obtained from a combination of amino acid residue substitution matrices and knowledge-based potentials.
Prlić, Andreas; Yates, Andrew; Bliven, Spencer E.; Rose, Peter W.; Jacobsen, Julius; Troshin, Peter V.; Chapman, Mark; Gao, Jianjiong; Koh, Chuan Hock; Foisy, Sylvain; Holland, Richard; Rimša, Gediminas; Heuer, Michael L.; Brandstätter–Müller, H.; Bourne, Philip E.; Willis, Scooter
Motivation: BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality. Results: BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model. Availability: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists Contact: email@example.com PMID:22877863
Kück, Patrick; Longo, Gary C
Phylogenetic and population genetic studies often deal with multiple sequence alignments that require manipulation or processing steps such as sequence concatenation, sequence renaming, sequence translation or consensus sequence generation. In recent years phylogenetic data sets have expanded from single genes to genome wide markers comprising hundreds to thousands of loci. Processing of these large phylogenomic data sets is impracticable without using automated process pipelines. Currently no stand-alone or pipeline compatible program exists that offers a broad range of manipulation and processing steps for multiple sequence alignments in a single process run. Here we present FASconCAT-G, a system independent editor, which offers various processing options for multiple sequence alignments. The software provides a wide range of possibilities to edit and concatenate multiple nucleotide, amino acid, and structure sequence alignment files for phylogenetic and population genetic purposes. The main options include sequence renaming, file format conversion, sequence translation between nucleotide and amino acid states, consensus generation of specific sequence blocks, sequence concatenation, model selection of amino acid replacement with ProtTest, two types of RY coding as well as site exclusions and extraction of parsimony informative sites. Convieniently, most options can be invoked in combination and performed during a single process run. Additionally, FASconCAT-G prints useful information regarding alignment characteristics and editing processes such as base compositions of single in- and outfiles, sequence areas in a concatenated supermatrix, as well as paired stem and loop regions in secondary structure sequence strings. FASconCAT-G is a command-line driven Perl program that delivers computationally fast and user-friendly processing of multiple sequence alignments for phylogenetic and population genetic applications and is well suited for incorporation into
Nagar, Anurag; Hahsler, Michael
Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to
Rocha, Jairo; Segura, Joan; Wilson, Richard C.; Dasgupta, Swagata
Motivation: Throughout evolution, homologous proteins have common regions that stay semi-rigid relative to each other and other parts that vary in a more noticeable way. In order to compare the increasing number of structures in the PDB, flexible geometrical alignments are needed, that are reliable and easy to use. Results: We present a protein structure alignment method whose main feature is the ability to consider different rigid transformations at different sites, allowing for deformations beyond a global rigid transformation. The performance of the method is comparable with that of the best ones from 10 aligners tested, regarding both the quality of the alignments with respect to hand curated ones, and the classification ability. An analysis of some structure pairs from the literature that need to be matched in a flexible fashion are shown. The use of a series of local transformations can be exported to other classifiers, and a future golden protein similarity measure could benefit from it. Availability: A public server for the program is available at http://dmi.uib.es/ProtDeform/. Contact: firstname.lastname@example.org Supplementary information: All data used, results and examples are available at http://dmi.uib.es/people/jairo/bio/ProtDeform.Supplementary data are available at Bioinformatics online. PMID:19417057
The island of Java (8.0S, 112.0E), perhaps better than any other, illustrates the volcanic origin of Pacific Island groups. Seen in this single view are at least a dozen once active volcano craters. Alignment of the craters even defines the linear fault line of Java as well as the other some 1500 islands of the Indonesian Archipelago. Deep blue water of the Indian Ocean to the south contrasts to the sediment laden waters of the Java Sea to the north.
The island of Java (8.0S, 112.0E), perhaps better than any other, illustrates the volcanic origin of Pacific Island groups. Seen in this single view are at least a dozen once active volcano craters. Alignment of the craters even defines the linear fault line of Java as well as the other some 1500 islands of the Indonesian Archipelago. Deep blue water of the Indian Ocean to the south contrasts to the sediment laden waters of the Java Sea to the north.
Ye, Yongtao; Lam, Tak-Wah; Ting, Hing-Fung
This paper describes a new MSA tool called PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies sequences into two types: normally related and distantly related. For normally related sequences, it uses an adaptive approach to construct the guide tree needed for progressive alignment; it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the better method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree and uses instead some non-progressive alignment method to generate the alignment. To evaluate PnpProbs, we have compared it with thirteen other popular MSA tools, and PnpProbs has the best alignment scores in all but one test. We have also used it for phylogenetic analysis, and found that the phylogenetic trees constructed from PnpProbs' alignments are closest to the model trees. By combining the strength of the progressive and non-progressive alignment methods, we have developed an MSA tool called PnpProbs. We have compared PnpProbs with thirteen other popular MSA tools and our results showed that our tool usually constructed the best alignments.
Kosugi, Shunichi; Natsume, Satoshi; Yoshida, Kentaro; MacLean, Daniel; Cano, Liliana; Kamoun, Sophien; Terauchi, Ryohei
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/. PMID:24116042
Boraik, Aziz Nasser; Abdullah, Rosni; Venkat, Ibrahim
Multiple sequence alignment (MSA) is an essential process for many biological sequence analyses. There are many algorithms developed to solve MSA, but an efficient computation method with very high accuracy is still a challenge. Progressive alignment is the most widely used approach to compute the final MSA. In this paper, we present a simple and effective progressive approach. Based on the independent order of sequences progressive alignment which proposed in QOMA, this method has been modified to align the whole sequences to maximize the score of MSA. Moreover, in order to further improve the accuracy of the method, we estimate the similarity of any pair of input sequences by using their percent identity, and based on this measure, we choose different substitution matrices during the progressive alignment. In addition, we have included horizontal information to alignment by adjusting the weights of amino acid residues based on their neighboring residues. The experimental results have been tested on popular benchmark of global protein sequences BAliBASE 3.0 and local protein sequences IRMBASE 2.0. The results of the proposed approach outperform the original method in QOMA in terms of sum-of-pair score and column score by up to 14% and 7% respectively.
Kryukov, Kirill; Saitou, Naruya
Large nucleotide sequence datasets are becoming increasingly common objects of comparison. Complete bacterial genomes are reported almost everyday. This creates challenges for developing new multiple sequence alignment methods. Conventional multiple alignment methods are based on pairwise alignment and/or progressive alignment techniques. These approaches have performance problems when the number of sequences is large and when dealing with genome scale sequences. We present a new method of multiple sequence alignment, called MISHIMA (Method for Inferring Sequence History In terms of Multiple Alignment), that does not depend on pairwise sequence comparison. A new algorithm is used to quickly find rare oligonucleotide sequences shared by all sequences. Divide and conquer approach is then applied to break the sequences into fragments that can be aligned independently by an external alignment program. These partial alignments are assembled together to form a complete alignment of the original sequences. MISHIMA provides improved performance compared to the commonly used multiple alignment methods. As an example, six complete genome sequences of bacteria species Helicobacter pylori (about 1.7 Mb each) were successfully aligned in about 6 hours using a single PC.
Leimeister, Chris-Andre; Boden, Marcus; Horwege, Sebastian; Lindner, Sebastian; Morgenstern, Burkhard
Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. Results: To reduce the statistical dependency between adjacent word matches, we propose to use ‘spaced words’, defined by patterns of ‘match’ and ‘don’t care’ positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Availability and implementation: Our program is freely available at http://spaced.gobics.de/. Contact: email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24700317
Kaya, Mehmet; Sarhan, Abdullah; Alhajj, Reda
Multiple sequence alignment is of central importance to bioinformatics and computational biology. Although a large number of algorithms for computing a multiple sequence alignment have been designed, the efficient computation of highly accurate and statistically significant multiple alignments is still a challenge. In this paper, we propose an efficient method by using multi-objective genetic algorithm (MSAGMOGA) to discover optimal alignments with affine gap in multiple sequence data. The main advantage of our approach is that a large number of tradeoff (i.e., non-dominated) alignments can be obtained by a single run with respect to conflicting objectives: affine gap penalty minimization and similarity and support maximization. To the best of our knowledge, this is the first effort with three objectives in this direction. The proposed method can be applied to any data set with a sequential character. Furthermore, it allows any choice of similarity measures for finding alignments. By analyzing the obtained optimal alignments, the decision maker can understand the tradeoff between the objectives. We compared our method with the three well-known multiple sequence alignment methods, MUSCLE, SAGA and MSA-GA. As the first of them is a progressive method, and the other two are based on evolutionary algorithms. Experiments on the BAliBASE 2.0 database were conducted and the results confirm that MSAGMOGA obtains the results with better accuracy statistical significance compared with the three well-known methods in aligning multiple sequence alignment with affine gap. The proposed method also finds solutions faster than the other evolutionary approaches mentioned above. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Al-Shatnawi, Mufleh; Ahmad, M Omair; Swamy, M N S
The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences is still a challenging problem. We propose a novel and efficient algorithm called, MSAIndelFR, for multiple sequence alignment using the information on the predicted locations of IndelFRs and the computed average log-loss values obtained from IndelFR predictors, each of which is designed for a different protein fold. We demonstrate that the introduction of a new variable gap penalty function based on the predicted locations of the IndelFRs and the computed average log-loss values into the proposed algorithm substantially improves the protein alignment accuracy. This is illustrated by evaluating the performance of the algorithm in aligning sequences belonging to the protein folds for which the IndelFR predictors already exist and by using the reference alignments of the four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABRE (SABmark 1.65). We have proposed a novel and efficient algorithm, the MSAIndelFR algorithm, for multiple protein sequence alignment incorporating a new variable gap penalty function. It is shown that the performance of the proposed algorithm is superior to that of the most-widely used alignment algorithms, Clustal W2, Clustal Omega, Kalign2, MSAProbs, MAFFT, MUSCLE, ProbCons and Probalign, in terms of both the sum-of-pairs and total column metrics.
While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing.
Agrawal, Ankit; Huang, Xiaoqiu
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
Chattopadhyay, Amit K; Nasiev, Diar; Flower, Darren R
Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from 'first passage probability distribution' to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. © The Author 2015. Published by Oxford University Press.
Iantorno, Stefano; Gori, Kevin; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe
Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies-based on simulation, consistency, protein structure, and phylogeny-and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application-with a keen awareness of the assumptions underlying each benchmarking strategy.
Will, Sebastian; Otto, Christina; Miladi, Milad; Möhl, Mathias; Backofen, Rolf
RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of [Formula: see text]. Subsequently, numerous faster 'Sankoff-style' approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity ([Formula: see text] quartic time). Breaking this barrier, we introduce the novel Sankoff-style algorithm 'sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)', which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff's original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics. © The Author 2015. Published by Oxford University Press.
Will, Sebastian; Otto, Christina; Miladi, Milad; Möhl, Mathias; Backofen, Rolf
Motivation: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of O(n6). Subsequently, numerous faster ‘Sankoff-style’ approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity (≥ quartic time). Results: Breaking this barrier, we introduce the novel Sankoff-style algorithm ‘sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)’, which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff’s original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics. Availability and implementation: SPARSE is freely available at http://www.bioinf.uni-freiburg.de/Software/SPARSE. Contact: firstname.lastname@example.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25838465
Margulies, Elliott H; Cooper, Gregory M; Asimenos, George; Thomas, Daryl J; Dewey, Colin N; Siepel, Adam; Birney, Ewan; Keefe, Damian; Schwartz, Ariel S; Hou, Minmei; Taylor, James; Nikolaev, Sergey; Montoya-Burgos, Juan I; Löytynoja, Ari; Whelan, Simon; Pardi, Fabio; Massingham, Tim; Brown, James B; Bickel, Peter; Holmes, Ian; Mullikin, James C; Ureta-Vidal, Abel; Paten, Benedict; Stone, Eric A; Rosenbloom, Kate R; Kent, W James; Bouffard, Gerard G; Guan, Xiaobin; Hansen, Nancy F; Idol, Jacquelyn R; Maduro, Valerie V B; Maskeri, Baishali; McDowell, Jennifer C; Park, Morgan; Thomas, Pamela J; Young, Alice C; Blakesley, Robert W; Muzny, Donna M; Sodergren, Erica; Wheeler, David A; Worley, Kim C; Jiang, Huaiyang; Weinstock, George M; Gibbs, Richard A; Graves, Tina; Fulton, Robert; Mardis, Elaine R; Wilson, Richard K; Clamp, Michele; Cuff, James; Gnerre, Sante; Jaffe, David B; Chang, Jean L; Lindblad-Toh, Kerstin; Lander, Eric S; Hinrichs, Angie; Trumbower, Heather; Clawson, Hiram; Zweig, Ann; Kuhn, Robert M; Barber, Galt; Harte, Rachel; Karolchik, Donna; Field, Matthew A; Moore, Richard A; Matthewson, Carrie A; Schein, Jacqueline E; Marra, Marco A; Antonarakis, Stylianos E; Batzoglou, Serafim; Goldman, Nick; Hardison, Ross; Haussler, David; Miller, Webb; Pachter, Lior; Green, Eric D; Sidow, Arend
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
Liu, Xuemei; Wan, Lin; Li, Jing; Reinert, Gesine; Waterman, Michael S.; Sun, Fengzhu
Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D2∗ and D2s showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D2∗ and D2s by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model. PMID:21723298
Tan, Ge; Muffato, Matthieu; Ledergerber, Christian; Herrero, Javier; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe
Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms. PMID:26031838
Vacic, Vladimir; Iakoucheva, Lilia M; Radivojac, Predrag
Two Sample Logo is a web-based tool that detects and displays statistically significant differences in position-specific symbol compositions between two sets of multiple sequence alignments. In a typical scenario, two groups of aligned sequences will share a common motif but will differ in their functional annotation. The inclusion of the background alignment provides an appropriate underlying amino acid or nucleotide distribution and addresses intersite symbol correlations. In addition, the difference detection process is sensitive to the sizes of the aligned groups. Two Sample Logo extends WebLogo, a widely-used sequence logo generator. The source code is distributed under the MIT Open Source license agreement and is available for download free of charge.
Valenzuela-González, Fabiola; Martínez-Porchas, Marcel; Villalpando-Canchola, Enrique; Vargas-Albores, Francisco
Ultrafast-metagenomic sequence classification using exact alignments (Kraken) is a novel approach to classify 16S rDNA sequences. The classifier is based on mapping short sequences to the lowest ancestor and performing alignments to form subtrees with specific weights in each taxon node. This study aimed to evaluate the classification performance of Kraken with long 16S rDNA random environmental sequences produced by cloning and then Sanger sequenced. A total of 480 clones were isolated and expanded, and 264 of these clones formed contigs (1352 ± 153 bp). The same sequences were analyzed using the Ribosomal Database Project (RDP) classifier. Deeper classification performance was achieved by Kraken than by the RDP: 73% of the contigs were classified up to the species or variety levels, whereas 67% of these contigs were classified no further than the genus level by the RDP. The results also demonstrated that unassembled sequences analyzed by Kraken provide similar or inclusively deeper information. Moreover, sequences that did not form contigs, which are usually discarded by other programs, provided meaningful information when analyzed by Kraken. Finally, it appears that the assembly step for Sanger sequences can be eliminated when using Kraken. Kraken cumulates the information of both sequence senses, providing additional elements for the classification. In conclusion, the results demonstrate that Kraken is an excellent choice for use in the taxonomic assignment of sequences obtained by Sanger sequencing or based on third generation sequencing, of which the main goal is to generate larger sequences.
Vanhoutreve, Renaud; Kress, Arnaud; Legrand, Baptiste; Gass, Hélène; Poch, Olivier; Thompson, Julie D
A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including 'core blocks', 'regions' and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge
Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.
Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license. PMID:27182962
Walker, Esther J; Bergen, Benjamin K; Núñez, Rafael
People use space in a variety of ways to structure their thoughts about time. The present report focuses on the different ways that space is employed when reasoning about deictic (past/future relationships) and sequence (earlier/later relationships) time. In the first study, we show that deictic and sequence time are aligned along the lateral axis in a manner consistent with previous work, with past and earlier events associated with left space and future and later events associated with right space. However, the alignment of time with space is different along the sagittal axis. Participants associated future events and earlier events-not later events-with the space in front of their body and past and later events with the space behind, consistent with the sagittal spatial terms (e.g., ahead, in front of) that we use to talk about deictic and sequence time. In the second study, we show that these associations between sequence time and sagittal space are sensitive to person-perspective. This suggests that the particular space-time associations observed in English speakers are influenced by a variety of different spatial properties, including spatial location and perspective. Copyright © 2016. Published by Elsevier B.V.
Gelly, Jean-Christophe; Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G.
With the immense growth in the number of available protein structures, fast and accurate structure comparison has been essential. We propose an efficient method for structure comparison, based on a structural alphabet. Protein Blocks (PBs) is a widely used structural alphabet with 16 pentapeptide conformations that can fairly approximate a complete protein chain. Thus a 3D structure can be translated into a 1D sequence of PBs. With a simple Needleman–Wunsch approach and a raw PB substitution matrix, PB-based structural alignments were better than many popular methods. iPBA web server presents an improved alignment approach using (i) specialized PB Substitution Matrices (SM) and (ii) anchor-based alignment methodology. With these developments, the quality of ∼88% of alignments was improved. iPBA alignments were also better than DALI, MUSTANG and GANGSTA+ in >80% of the cases. The webserver is designed to for both pairwise comparisons and database searches. Outputs are given as sequence alignment and superposed 3D structures displayed using PyMol and Jmol. A local alignment option for detecting subs-structural similarity is also embedded. As a fast and efficient ‘sequence-based’ structure comparison tool, we believe that it will be quite useful to the scientific community. iPBA can be accessed at http://www.dsimb.inserm.fr/dsimb_tools/ipba/. PMID:21586582
Srimani, Jaydeep K; Wu, Po-Yen; Phan, John H; Wang, May D
We developed a scalable distributed computing system using the Berkeley Open Interface for Network Computing (BOINC) to align next-generation sequencing (NGS) data quickly and accurately. NGS technology is emerging as a promising platform for gene expression analysis due to its high sensitivity compared to traditional genomic microarray technology. However, despite the benefits, NGS datasets can be prohibitively large, requiring significant computing resources to obtain sequence alignment results. Moreover, as the data and alignment algorithms become more prevalent, it will become necessary to examine the effect of the multitude of alignment parameters on various NGS systems. We validate the distributed software system by (1) computing simple timing results to show the speed-up gained by using multiple computers, (2) optimizing alignment parameters using simulated NGS data, and (3) computing NGS expression levels for a single biological sample using optimal parameters and comparing these expression levels to that of a microarray sample. Results indicate that the distributed alignment system achieves approximately a linear speed-up and correctly distributes sequence data to and gathers alignment results from multiple compute clients.
Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver
Motivation: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. Results: In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Availability: Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license. Contact: email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22556368
Thompson, Julie D.; Linard, Benjamin; Lecompte, Odile; Poch, Olivier
Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies. PMID:21483869
Daily, Jeffrey A.
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
Collyda, Chrysa; Diplaris, Sotiris; Mitkas, Pericles A; Maglaveras, Nicos; Pappas, Costas
This paper proposes a novel method for aligning multiple genomic or proteomic sequences using a fuzzyfied Hidden Markov Model (HMM). HMMs are known to provide compelling performance among multiple sequence alignment (MSA) algorithms, yet their stochastic nature does not help them cope with the existing dependence among the sequence elements. Fuzzy HMMs are a novel type of HMMs based on fuzzy sets and fuzzy integrals which generalizes the classical stochastic HMM, by relaxing its independence assumptions. In this paper, the fuzzy HMM model for MSA is mathematically defined. New fuzzy algorithms are described for building and training fuzzy HMMs, as well as for their use in aligning multiple sequences. Fuzzy HMMs can also increase the model capability of aligning multiple sequences mainly in terms of computation time. Modeling the multiple sequence alignment procedure with fuzzy HMMs can yield a robust and time-effective solution that can be widely used in bioinformatics in various applications, such as protein classification, phylogenetic analysis and gene prediction, among others.
Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily. PMID:26403857
Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily.
Huang, Shunping; Holt, James; Kao, Chia-Yu; McMillan, Leonard; Wang, Wei
Mapping reads to a reference sequence is a common step when analyzing allele effects in high-throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending on the genetic distances of the target sequences from the reference. To avoid this bias, researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single-reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins. Database URL: http://csbio.unc.edu/CCstatus/index.py?run=Pseudo.
Taheri, Javid; Zomaya, Albert Y
This paper presents a novel approach to solve the Multiple Sequence Alignment (MSA) problem. The Rubber Band Technique: Location Base (RBT-L) introduced in this paper, is inspired by the elastic behaviour of a Rubber Band (RB) on a plate with poles. RBT-L is an iterative optimisation algorithm designed and implemented to find the optimal alignment for a set of input protein sequences. RBT-L is tested with one of the well-known benchmarks (BALiBASE 2.0) in this field. The obtained results show the superiority of the proposed technique even in the case of formidable sequences.
Zhu, Huazheng; He, Zhongshi; Jia, Yuanyuan
Multiple sequence alignment (MSA) is a fundamental and key step for implementing other tasks in bioinformatics, such as phylogenetic analyses, identification of conserved motifs and domains, structure prediction, etc. Despite the fact that there are many methods to implement MSA, biologically perfect alignment approaches are not found hitherto. This paper proposes a novel idea to perform MSA, where MSA is treated as a multiobjective optimization problem. A famous multiobjective evolutionary algorithm framework based on decomposition is applied for solving MSA, named MOMSA. In the MOMSA algorithm, we develop a new population initialization method and a novel mutation operator. We compare the performance of MOMSA with several alignment methods based on evolutionary algorithms, including VDGA, GAPAM, and IMSA, and also with state-of-the-art progressive alignment approaches, such as MSAprobs, Probalign, MAFFT, Procons, Clustal omega, T-Coffee, Kalign2, MUSCLE, FSA, Dialign, PRANK, and CLUSTALW. These alignment algorithms are tested on benchmark datasets BAliBASE 2.0 and BAliBASE 3.0. Experimental results show that MOMSA can obtain the significantly better alignments than VDGA, GAPAM on the most of test cases by statistical analyses, produce better alignments than IMSA in terms of TC scores, and also indicate that MOMSA is comparable with the leading progressive alignment approaches in terms of quality of alignments.
Layeb, Abdesslem; Selmane, Marwa; Elhoucine, Maroua Bencheikh
The Multiple Sequence Alignment (MSA) is one of the most challenging tasks in bioinformatics. It consists of aligning several sequences to show the fundamental relationship and the common characteristics between a set of protein or nucleic sequences; this problem has been shown to be NP-complete if the number of sequences is >2. In this paper, a new incomplete algorithm based on a Greedy Randomised Adaptive Search Procedure (GRASP) is presented to deal with the MSA problem. The first GRASP's phase is a new greedy algorithm based on the application of a new random progressive method and a hybrid global/local algorithm. The second phase is an adaptive refinement method based on consensus alignment. The obtained results are very encouraging and show the feasibility and effectiveness of the proposed approach.
Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong
Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.
Corel, Eduardo; Pitschi, Florian; Laprevotte, Ivan; Grasseau, Gilles; Didier, Gilles; Devauchelle, Claudine
While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is
Di Tommaso, Paolo; Bussotti, Giovanni; Kemena, Carsten; Capriotti, Emidio; Chatzou, Maria; Prieto, Pablo; Notredame, Cedric
This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for every sequence, column and aligned residue. SARA-Coffee combines SARA, a pairwise structural RNA aligner with the R-Coffee multiple RNA aligner in a way that has been shown to improve alignment accuracy over most sequence aligners when enough structural data is available. The server can be accessed from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee. PMID:24972831
Background In recent years, an exponential growing number of tools for protein sequence analysis, editing and modeling tasks have been put at the disposal of the scientific community. Despite the vast majority of these tools have been released as open source software, their deep learning curves often discourages even the most experienced users. Results A simple and intuitive interface, PyMod, between the popular molecular graphics system PyMOL and several other tools (i.e., [PSI-]BLAST, ClustalW, MUSCLE, CEalign and MODELLER) has been developed, to show how the integration of the individual steps required for homology modeling and sequence/structure analysis within the PyMOL framework can hugely simplify these tasks. Sequence similarity searches, multiple sequence and structural alignments generation and editing, and even the possibility to merge sequence and structure alignments have been implemented in PyMod, with the aim of creating a simple, yet powerful tool for sequence and structure analysis and building of homology models. Conclusions PyMod represents a new tool for the analysis and the manipulation of protein sequences and structures. The ease of use, integration with many sequence retrieving and alignment tools and PyMOL, one of the most used molecular visualization system, are the key features of this tool. Source code, installation instructions, video tutorials and a user's guide are freely available at the URL http://schubert.bio.uniroma1.it/pymod/index.html PMID:22536966
Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J. L.; Nap, Jan Peter
Motivation To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. Results With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation. PMID:25830241
Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter
To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
Gao, Kun; Miller, Jonathan
Distributions of duplicated sequences from genome self-alignment are characterized, including forward and backward alignments in bacteria and eukaryotes. A Markovian process without auto-correlation should generate an exponential distribution expected from local effects of point mutation and selection on localised function; however, the observed distributions show substantial deviation from exponential form – they are roughly algebraic instead – suggesting a novel kind of long-distance correlation that must be non-local in origin. PMID:21779315
Gillette, Todd A; Hosseini, Parsa; Ascoli, Giorgio A
The increasing abundance of neuromorphological data provides both the opportunity and the challenge to compare massive numbers of neurons from a wide diversity of sources efficiently and effectively. We implemented a modified global alignment algorithm representing axonal and dendritic bifurcations as strings of characters. Sequence alignment quantifies neuronal similarity by identifying branch-level correspondences between trees. The space generated from pairwise similarities is capable of classifying neuronal arbor types as well as, or better than, traditional topological metrics. Unsupervised cluster analysis produces groups that significantly correspond with known cell classes for axons, dendrites, and pyramidal apical dendrites. Furthermore, the distinguishing consensus topology generated by multiple sequence alignment of a group of neurons reveals their shared branching blueprint. Interestingly, the axons of dendritic-targeting interneurons in the rodent cortex associates with pyramidal axons but apart from the (more topologically symmetric) axons of perisomatic-targeting interneurons. Global pairwise and multiple sequence alignment of neurite topologies enables detailed comparison of neurites and identification of conserved topological features in alignment-defined clusters. The methods presented also provide a framework for incorporation of additional branch-level morphological features. Moreover, comparison of multiple alignment with motif analysis shows that the two techniques provide complementary information respectively revealing global and local features.
Koretke, K. K.; Luthey-Schulten, Z.; Wolynes, P. G.
A quantitative form of the principle of minimal frustration is used to obtain from a database analysis statistical mechanical energy functions and gap parameters for aligning sequences to three-dimensional structures. The analysis that partially takes into account correlations in the energy landscape improves upon the previous approximations of Goldstein et al. (1994, 1995) (Goldstein R, Luthey-Schulten Z, Wolynes P, 1994, Proceedings of the 27th Hawaii International Conference on System Sciences. Los Alamitos, California: IEEE Computer Society Press. pp 306-315; Goldstein R, Luthey-Schulten Z, Wolynes P, 1995, In: Elber R, ed. New developments in theoretical studies of proteins. Singapore: World Scientific). The energy function allows for ordering of alignments based on the compatibility of a sequence to be in a given structure (i.e., lowest energy) and therefore removes the necessity of using percent identity or similarity as scoring parameters. The alignments produced by the energy function on distant homologues with low percent identity (less than 21%) are generally better than those generated with evolutionary information. The lowest energy alignment generated with the energy function for sequences containing prosite signatures but unknown structures is a structure containing the same prosite signature, providing a check on the robustness of the algorithm. Finally, the energy function can make use of known experimental evidence as constraints within the alignment algorithm to aid in finding the correct structural alignment. PMID:8762136
Liao, Weinan; Ren, Jie; Wang, Kun; Wang, Shun; Zeng, Feng; Wang, Ying; Sun, Fengzhu
The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com. PMID:27876823
Burkov, Boris; Nagaev, Boris; Spirin, Sergei; Alexeevski, Andrei
It makes sense to speak of alignment of protein sequences only within the regions, where the sequences are related to each other. This simple consideration is often disregarded by programs of multiple alignment construction. A package for alignment analysis MAlAKiTE (Multiple Alignment Automatic Kinship Tiling Engine) is introduced. It aims to find the blocks of reliable alignment, which contain related regions only, within the whole alignment and allows for dealing with them. The validity of the detection of reliable blocks' was verified by comparison with structural data.
Chia-Ling Tsai; Hung-Chuan Hsu; Xin-Chang Wu; Shih-Jen Chen; Wei-Yang Lin
In ophthalmology, aligning images in indocyanine green and fluorescein angiograph sequences is important for the treatment of subretinal lesions. This paper introduces an algorithm that is tailored to align jointly in a common reference space all the images in an angiogram sequence containing both modalities. To overcome the issues of low image contrast and low signal-to-noise ratio for late-phase images, the structural similarity between two images is enhanced using Gabor wavelet transform. Image pairs are pairwise registered and the transformations are simultaneously and globally adjusted for a mutually consistent joint alignment. The joint registration process is incremental and the success depends on the correctness of matches from the pairwise registration. To safeguard the joint process, our system performs the consistency test to exclude incorrect pairwise results automatically to ensure correct matches as more images are jointly aligned. Our dataset consists of 60 sequences of polypoidal choroidal vasculopathy collected by the EVEREST Study Group. On average, each sequence contains 20 images. Our algorithm successfully pairwise registered 95.04% of all image pairs, and joint registered 98.7% of all images, with an average alignment error of 1.58 pixels.
Interpretation of multiple sequence alignments is of major interest for the prediction of functional and structural domains in proteins or for the organization of related sequences in families and subfamilies. However, a necessity for the bench scientist is the use of outstanding programs in a friendly computing environment. This paper describes Color and Graphic Display (CGD), a set of modules that runs as part of the Microsoft Excel spreadsheet to color and analyze multiple sequence alignments. Discussed here are the main functions of CGD and the use of the program to highlight residues of importance in a water channel family. Although CGD was created for protein sequences, most of the modules are compatible with DNA sequences.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Lu, Emily; Elizondo-Riojas, Miguel-Angel; Chang, Jeffrey T; Volk, David E
Next-generation sequencing results from bead-based aptamer libraries have demonstrated that traditional DNA/RNA alignment software is insufficient. This is particularly true for X-aptamers containing specialty bases (W, X, Y, Z, ...) that are identified by special encoding. Thus, we sought an automated program that uses the inherent design scheme of bead-based X-aptamers to create a hypothetical reference library and Markov modeling techniques to provide improved alignments. Aptaligner provides this feature as well as length error and noise level cutoff features, is parallelized to run on multiple central processing units (cores), and sorts sequences from a single chip into projects and subprojects.
Ovcharenko, I; Loots, G; Giardine, B; Hou, M; Ma, J; Hardison, R; Stubbs, L; Miller, W
Multiple sequence alignment analysis is a powerful approach for understanding phylogenetic relationships, annotating genes and detecting functional regulatory elements. With a growing number of partly or fully sequenced vertebrate genomes, effective tools for performing multiple comparisons are required to accurately and efficiently assist biological discoveries. Here we introduce Mulan (http://mulan.dcode.org/), a novel method and a network server for comparing multiple draft and finished-quality sequences to identify functional elements conserved over evolutionary time. Mulan brings together several novel algorithms: the tba multi-aligner program for rapid identification of local sequence conservation and the multiTF program for detecting evolutionarily conserved transcription factor binding sites in multiple alignments. In addition, Mulan supports two-way communication with the GALA database; alignments of multiple species dynamically generated in GALA can be viewed in Mulan, and conserved transcription factor binding sites identified with Mulan/multiTF can be integrated and overlaid with extensive genome annotation data using GALA. Local multiple alignments computed by Mulan ensure reliable representation of short-and large-scale genomic rearrangements in distant organisms. Mulan allows for interactive modification of critical conservation parameters to differentially predict conserved regions in comparisons of both closely and distantly related species. We illustrate the uses and applications of the Mulan tool through multi-species comparisons of the GATA3 gene locus and the identification of elements that are conserved differently in avians than in other genomes allowing speculation on the evolution of birds. Source code for the aligners and the aligner-evaluation software can be freely downloaded from http://bio.cse.psu.edu/.
Tan, Ge; Muffato, Matthieu; Ledergerber, Christian; Herrero, Javier; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe
Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society of Systematic Biologists.
Hagopian, Raffi; Davidson, John R; Datta, Ruchira S; Samad, Bushra; Jarvis, Glen R; Sjölander, Kimmen
We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.
Rani, R Ranjani; Ramyachitra, D
Multiple sequence alignment (MSA) is a widespread approach in computational biology and bioinformatics. MSA deals with how the sequences of nucleotides and amino acids are sequenced with possible alignment and minimum number of gaps between them, which directs to the functional, evolutionary and structural relationships among the sequences. Still the computation of MSA is a challenging task to provide an efficient accuracy and statistically significant results of alignments. In this work, the Bacterial Foraging Optimization Algorithm was employed to align the biological sequences which resulted in a non-dominated optimal solution. It employs Multi-objective, such as: Maximization of Similarity, Non-gap percentage, Conserved blocks and Minimization of gap penalty. BAliBASE 3.0 benchmark database was utilized to examine the proposed algorithm against other methods In this paper, two algorithms have been proposed: Hybrid Genetic Algorithm with Artificial Bee Colony (GA-ABC) and Bacterial Foraging Optimization Algorithm. It was found that Hybrid Genetic Algorithm with Artificial Bee Colony performed better than the existing optimization algorithms. But still the conserved blocks were not obtained using GA-ABC. Then BFO was used for the alignment and the conserved blocks were obtained. The proposed Multi-Objective Bacterial Foraging Optimization Algorithm (MO-BFO) was compared with widely used MSA methods Clustal Omega, Kalign, MUSCLE, MAFFT, Genetic Algorithm (GA), Ant Colony Optimization (ACO), Artificial Bee Colony (ABC), Particle Swarm Optimization (PSO) and Hybrid Genetic Algorithm with Artificial Bee Colony (GA-ABC). The final results show that the proposed MO-BFO algorithm yields better alignment than most widely used methods.
Oldfield, Thomas J
The amount of biological data available from experimental techniques is huge, and rapidly expanding. The ability to make sense of this vast amount of data requires that we make correlations between distinct biological disciplines using visualization techniques to highlight the critical information. This article describes the visualization techniques of dynamic data brushing, view context maintenance, fisheye sequence view, and a magic lens that have been developed to display protein structure and sequence information.
STS072-737-012 (11-20 Jan. 1996) --- The astronauts photographed this view of Java, an Indonesian island. Java lies between the Java Sea at top and the Indian Ocean at bottom (north is located at top center). A line of volcanoes on the southern edge of the island, trending from central to eastern areas, is highlighted by a ring of clouds. Off the southern coast of Java is the Java Trench where the Australian plate, to the south, is diving under the Eurasia plate to the north. According to anthropologists, Java has one of the highest populations in Indonesia because the soil is enriched by volcanic ash. Merapi volcano, at left edge, second volcano to the right, rises to 9,550 feet and erupts frequently. Madura Island, partially obscured by clouds, can be seen on the upper eastern end of Java.
Östlund, Johan; Wrigstad, Tobias
This paper presents Welterweight Java (WJ), a new minimal core Java calculus intended to be a suitable starting point for investigations in the semantics of Java-like programs. To this end, WJ adds a few extra pounds to Featherweight Java. WJ is imperative and stateful, which is a frequent extension of Featherweight Java. To account for the importance of concurrency, WJ models Java's thread-based concurrency and lock-based synchronisation. The design of WJ is distilled from recent work on concurrent Java-like systems. We believe that the calculus is a good starting point for extensions. We illustrate the potential of the calculus by showing two extensions. The first is a version of WJ extended with deep ownership. This serves two purposes - it is a minimal formalisation of ownership, interesting in its own right, and shows how easily WJ can be extended. The second is a simple non-null types system.
To decrypt a doubly heterozygous sequence (DHS) in order to define the indel mutation for mutation reporting, an algorithm recursively searching the overlapped nucleotide using an offset of nucleotide positions can decrypt the indel without using a reference sequence. However, as genetic code is letter-based, special computer programs are required to run the decryption algorithm. The previous text-based algorithm was converted to a number-based algorithm by expressing DNA sequence from a 4-letter genetic code to a 4-prime-number genetic code, i.e., converting A, C, G, T to 2, 3, 5, and 7. This algorithm based on prime-number genetic code is called PrimeIndel and is executable by spreadsheet. Using prime number coded DNA sequence, the overlapped nucleotide between any 2 positions of the DHS is represented by the greatest common divisor (GCD) of the multiplication product of 2 prime numbers. This algorithm can also be used for aligning multiple overlapping sequence reads by in-silico DHS formation. The indel size of the in-silico formed DHS indicates the positions in the paired sequences for correct alignment. DHSs were successfully decrypted by the prime number-based algorithm and sequence reads were aligned correctly. DNA sequence expressed in prime numbers can be used for the decryption of DHS and the alignment of sequence reads using a well-known mathematical function GCD of a spreadsheet program. PrimeIndel is a useful tool for mutation reporting in clinical laboratories. The software is downloadable from http://www.patho.hku.hk/staff/list/cwlam.htm. Copyright © 2014 Elsevier B.V. All rights reserved.
Chen, Weiyang; Liao, Bo; Zhu, Wen; Xiang, Xuyu
In this article, we describe a representation for the processes of multiple sequences alignment (MSA) and used it to solve the problem of MSA. By this representation, we took every possible aligning result into account by defining the representation of gap insertion, the value of heuristic information in every optional path and scoring rule. On the basis of the proposed multidimensional graph, we used the ant colony algorithm to find the better path that denotes a better aligning result. In our article, we proposed the instance of three-dimensional graph and four-dimensional graph and advanced a special ichnographic representation to analyze MSA. It is yet only an experimental software, and we gave an example for finding the best aligning result by three-dimensional graph and ant colony algorithm. Experimental results show that our method can improve the solution quality on MSA benchmarks. Copyright 2009 Wiley Periodicals, Inc.
BINDEWALD, ECKART; SHAPIRO, BRUCE A.
We present a machine learning method (a hierarchical network of k-nearest neighbor classifiers) that uses an RNA sequence alignment in order to predict a consensus RNA secondary structure. The input to the network is the mutual information, the fraction of complementary nucleotides, and a novel consensus RNAfold secondary structure prediction of a pair of alignment columns and its nearest neighbors. Given this input, the network computes a prediction as to whether a particular pair of alignment columns corresponds to a base pair. By using a comprehensive test set of 49 RFAM alignments, the program KNetFold achieves an average Matthews correlation coefficient of 0.81. This is a significant improvement compared with the secondary structure prediction methods PFOLD and RNAalifold. By using the example of archaeal RNase P, we show that the program can also predict pseudoknot interactions. PMID:16495232
Bindewald, Eckart; Shapiro, Bruce A
We present a machine learning method (a hierarchical network of k-nearest neighbor classifiers) that uses an RNA sequence alignment in order to predict a consensus RNA secondary structure. The input to the network is the mutual information, the fraction of complementary nucleotides, and a novel consensus RNAfold secondary structure prediction of a pair of alignment columns and its nearest neighbors. Given this input, the network computes a prediction as to whether a particular pair of alignment columns corresponds to a base pair. By using a comprehensive test set of 49 RFAM alignments, the program KNetFold achieves an average Matthews correlation coefficient of 0.81. This is a significant improvement compared with the secondary structure prediction methods PFOLD and RNAalifold. By using the example of archaeal RNase P, we show that the program can also predict pseudoknot interactions.
Althaus, Ernst; Caprara, Alberto; Lenhof, Hans-Peter; Reinert, Knut
Multiple sequence alignment is one of the dominant problems in computational molecular biology. Numerous scoring functions and methods have been proposed, most of which result in NP-hard problems. In this paper we propose for the first time a general formulation for multiple alignment with arbitrary gap-costs based on an integer linear program (ILP). In addition we describe a branch-and-cut algorithm to effectively solve the ILP to optimality. We evaluate the performances of our approach in terms of running time and quality of the alignments using the BAliBase database of reference alignments. The results show that our implementation ranks amongst the best programs developed so far.
Schwarz, Roland F; Tamuri, Asif U; Kultys, Marek; King, James; Godwin, James; Florescu, Ana M; Schultz, Jörg; Goldman, Nick
Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles). © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Raghava, GPS; Barton, Geoffrey J
Background Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method. Results The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84. Conclusion Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to
Qian, Kun; Luan, Yihui
Alignment-free sequence comparison is becoming fairly popular in many fields of computational biology due to less requirements for sequence itself and computational efficiency for a large scale of sequence data sets. Especially, the approaches based on k-tuple like D2, D2S and D2∗ are used widely and effectively. However, these measures treat each k-tuple equally without accounting for the potential importance differences among all k-tuples. In this paper, we take advantage of maximizing deviation method proposed in multiple attribute decision making to evaluate the weights of different k-tuples. We modify D2, D2S and D2∗ with weights and test them by similarity search and evaluation on functionally related regulatory sequences. The results demonstrate that the newly proposed measures are more efficient and robust compared to existing alignment-free methods.
Long, Hai-Xia; Xu, Wen-Bo; Sun, Jun
Multiple sequence alignment (MSA) is a fundamental and challenging problem in the analysis of biologic sequence. The MSA problem is hard to be solved directly, for it always results in exponential complexity with the scale of the problem. In this paper, we propose mutation-based binary particle swarm optimization (M-BPSO) for MSA solving. In the proposed M-BPSO algorithm, BPSO algorithm is conducted to provide alignments. Thereafter, mutation operator is performed to move out of local optima and speed up convergence. From simulation results of nucleic acid and amino acid sequences, it is shown that the proposed M-BPSO algorithm has superior performance when compared to other existing algorithms. Furthermore, this algorithm can be used quickly and efficiently for smaller and medium size sequences.
Depiereux, E; Baudoux, G; Briffeuil, P; Reginster, I; De Bolle, X; Vinals, C; Feytmans, E
The Match-Box software comprises protein sequence alignment tools based on strict statistical thresholds of similarity between protein segments. The method circumvents the gap penalty requirement: gaps being the result of the alignment and not a governing parameter of the procedure. The reliable conserved regions outlined by Match-Box are particularly relevant for homology modelling of protein structures, prediction of essential residues for site-directed mutagenesis and oligonucleotide design for cloning homologous genes by polymerase chain reaction (PCR). The method produces reliable results, as assessed by tests performed on protein families of known structures and of low sequence similarity. A reliability score is computed in relation to a threshold of similarity progressively raised to extend the aligned regions to their maximal length, up to the significance limit of matching segments. The score obtained at each position is printed below the sequences and allows a discriminant reading of each aligned region. Sequences may be submitted to a Web server at http://www.fundp.ac.be/sciences/biologie/bms/+ ++matchbox_submit.html or sent by e-mail to matchbox/biq.fundp.ac.be (help available by just mailing help).
Background Methods of alignment masking, which refers to the technique of excluding alignment blocks prior to tree reconstructions, have been successful in improving the signal-to-noise ratio in sequence alignments. However, the lack of formally well defined methods to identify randomness in sequence alignments has prevented a routine application of alignment masking. In this study, we compared the effects on tree reconstructions of the most commonly used profiling method (GBLOCKS) which uses a predefined set of rules in combination with alignment masking, with a new profiling approach (ALISCORE) based on Monte Carlo resampling within a sliding window, using different data sets and alignment methods. While the GBLOCKS approach excludes variable sections above a certain threshold which choice is left arbitrary, the ALISCORE algorithm is free of a priori rating of parameter space and therefore more objective. Results ALISCORE was successfully extended to amino acids using a proportional model and empirical substitution matrices to score randomness in multiple sequence alignments. A complex bootstrap resampling leads to an even distribution of scores of randomly similar sequences to assess randomness of the observed sequence similarity. Testing performance on real data, both masking methods, GBLOCKS and ALISCORE, helped to improve tree resolution. The sliding window approach was less sensitive to different alignments of identical data sets and performed equally well on all data sets. Concurrently, ALISCORE is capable of dealing with different substitution patterns and heterogeneous base composition. ALISCORE and the most relaxed GBLOCKS gap parameter setting performed best on all data sets. Correspondingly, Neighbor-Net analyses showed the most decrease in conflict. Conclusions Alignment masking improves signal-to-noise ratio in multiple sequence alignments prior to phylogenetic reconstruction. Given the robust performance of alignment profiling, alignment masking
Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn (2)) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to
Lawrence, C.; Lin, L.; Lisiecki, L. E.; Stern, J.
The stratigraphic alignment of ocean sediment cores plays a vital role in paleoceanographic research because it is used to develop mutually consistent age models for climate proxies measured in these cores. The most common proxy used for alignment is the The stratigraphic alignment of ocean sediment cores plays a vital role in paleoceanographic research because it is used to develop mutually consistent age models for climate proxies measured in these cores. The most common proxy used for alignment is the δ18O of calcite from benthic or planktonic foraminifera because a large fraction of δ18O variance derives from the global signal of ice volume. To date, alignment has been performed either by manual, qualitative comparison or by deterministic algorithms (Martinson, Pisias et al. Quat. Res. 27 1987; Lisiecki and Lisiecki Paleoceanography 17, 2002; Huybers and Wunsch, Paleoceanography 19, 2004). Here we present a probabilistic sequence alignment algorithm which provides 95% confidence bands for the alignment of pairs of benthic δ18O records. The probabilistic algorithm presented here is based on a hidden Markov model (HMM) (Levinson, Rabiner et al. Bell Systems Technical Journal, 62,1983) similar to those that have been used extensively to align DNA and protein sequences (Durbin, Eddy et al. Biological Sequence Analysis, Ch. 4, 1998). However, here the need to the alignment of sequences stems from expansion and/or contraction in the records due to changes in sedimentation rates rather than the insertion or deletion of residues. Transition probabilities that are used in this HMM to model changes in sedimentation rates are based on radiocarbon estimates of sedimentation rates. The probabilistic algorithm considers all possible alignments with these predefined sedimentation rates. Exact calculations are completed using dynamic programming recursions. The algorithm yields the probability distributions of the age at each point in the record, which are probabilistically
Li, Yushuang; Liu, Qian; Zheng, Xiaoqi
A highly compact and simple 2D graphical representation of DNA sequences, named DUC-Curve, is constructed through mapping four nucleotides to a unit circle with a cyclic order. DUC-Curve could directly detect nucleotide, di-nucleotide compositions and microsatellite structure from DNA sequences. Moreover, it also could be used for DNA sequence alignment. Taking geometric center vectors of DUC-Curves as sequence descriptor, we perform similarity analysis on the first exons of β-globin genes of 11 species, oncogene TP53 of 27 species and twenty-four Influenza A viruses, respectively. The obtained reasonable results illustrate that the proposed method is very effective in sequence comparison problems, and will at least play a complementary role in classification and clustering problems.
Dr. George L. Mesina; Steven P. Miller
The XMGR5 graphing package  for drawing RELAP5  plots is being re-written in Java . Java is a robust programming language that is available at no cost for most computer platforms from Sun Microsystems, Inc. XMGR5 is an extension of an XY plotting tool called ACE/gr extended to plot data from several US Nuclear Regulatory Commission (NRC) applications. It is also the most popular graphing package worldwide for making RELAP5 plots. In Section 1, a short review of XMGR5 is given, followed by a brief overview of Java. In Section 2, shortcomings of both tkXMGR  and XMGR5 are discussed and the value of converting to Java is given. Details of the conversion to Java are given in Section 3. The progress to date, some conclusions and future work are given in Section 4. Some screen shots of the Java version are shown.
Lee, Bernett T K; Tan, Tin Wee; Ranganathan, Shoba
Splicing is a biological phenomenon that removes the non-coding sequence from the transcripts to produce a mature transcript suitable for translation. To study this phenomenon, information on the intron-exon arrangement of a gene is essential, usually obtained by aligning mRNA/EST sequences to their cognate genomic sequences. MGAlign is a novel, rapid, memory efficient and practical method for aligning mRNA/EST and genome sequences. We present here a freely available web service, MGAlignIt (http://origin.bic.nus.edu.sg/mgalign/mgalignit), based on MGAlign. Besides the alignment itself, this web service allows users to effectively visualize the alignment in a graphical manner and to perform limited analysis on the alignment output. The server also permits the alignment to be saved in several forms, both graphical and text, suitable for further processing and analysis by other programs.
Cannone, Jamie J; Sweeney, Blake A; Petrov, Anton I; Gutell, Robin R; Zirbel, Craig L; Leontis, Neocles
The RNA 3D Structure-to-Multiple Sequence Alignment Server (R3D-2-MSA) is a new web service that seamlessly links RNA three-dimensional (3D) structures to high-quality RNA multiple sequence alignments (MSAs) from diverse biological sources. In this first release, R3D-2-MSA provides manual and programmatic access to curated, representative ribosomal RNA sequence alignments from bacterial, archaeal, eukaryal and organellar ribosomes, using nucleotide numbers from representative atomic-resolution 3D structures. A web-based front end is available for manual entry and an Application Program Interface for programmatic access. Users can specify up to five ranges of nucleotides and 50 nucleotide positions per range. The R3D-2-MSA server maps these ranges to the appropriate columns of the corresponding MSA and returns the contents of the columns, either for display in a web browser or in JSON format for subsequent programmatic use. The browser output page provides a 3D interactive display of the query, a full list of sequence variants with taxonomic information and a statistical summary of distinct sequence variants found. The output can be filtered and sorted in the browser. Previous user queries can be viewed at any time by resubmitting the output URL, which encodes the search and re-generates the results. The service is freely available with no login requirement at http://rna.bgsu.edu/r3d-2-msa.
Cannone, Jamie J.; Sweeney, Blake A.; Petrov, Anton I.; Gutell, Robin R.; Zirbel, Craig L.; Leontis, Neocles
The RNA 3D Structure-to-Multiple Sequence Alignment Server (R3D-2-MSA) is a new web service that seamlessly links RNA three-dimensional (3D) structures to high-quality RNA multiple sequence alignments (MSAs) from diverse biological sources. In this first release, R3D-2-MSA provides manual and programmatic access to curated, representative ribosomal RNA sequence alignments from bacterial, archaeal, eukaryal and organellar ribosomes, using nucleotide numbers from representative atomic-resolution 3D structures. A web-based front end is available for manual entry and an Application Program Interface for programmatic access. Users can specify up to five ranges of nucleotides and 50 nucleotide positions per range. The R3D-2-MSA server maps these ranges to the appropriate columns of the corresponding MSA and returns the contents of the columns, either for display in a web browser or in JSON format for subsequent programmatic use. The browser output page provides a 3D interactive display of the query, a full list of sequence variants with taxonomic information and a statistical summary of distinct sequence variants found. The output can be filtered and sorted in the browser. Previous user queries can be viewed at any time by resubmitting the output URL, which encodes the search and re-generates the results. The service is freely available with no login requirement at http://rna.bgsu.edu/r3d-2-msa. PMID:26048960
Background Logos are commonly used in molecular biology to provide a compact graphical representation of the conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter at that position. Results We present a new tool and web server, called Skylign, which provides a unified framework for creating logos for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant sequences and by combining observed counts with informed priors. It also simplifies the representation of gap parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position. Conclusion Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software package for download. Skylign’s interactive logos are easily incorporated into a web page with just a few lines of HTML markup. Skylign may be found at http://skylign.org. PMID:24410852
Muhire, Brejnev Muhizi; Varsani, Arvind; Martin, Darren Patrick
The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms).
Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.; Brudno, Michael; Batzoglou, Serafim; Bethel, E. Wes; Rubin, Edward M.; Hamann, Bernd; Dubchak, Inna
We have developed Phylo-VISTA (Shah et al., 2003), an interactive software tool for analyzing multiple alignments by visualizing a similarity measure for DNA sequences of multiple species. The complexity of visual presentation is effectively organized using a framework based upon inter-species phylogenetic relationships. The phylogenetic organization supports rapid, user-guided inter-species comparison. To aid in navigation through large sequence datasets, Phylo-VISTA provides a user with the ability to select and view data at varying resolutions. The combination of multi-resolution data visualization and analysis, combined with the phylogenetic framework for inter-species comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments.
Sakib, Muhammad Nazmus; Tang, Jijun; Zheng, W. Jim; Huang, Chin-Tser
Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly. PMID:22164252
Chiner-Oms, Alvaro; González-Candelas, Fernando
We present EvalMSA, a software tool for evaluating and detecting outliers in multiple sequence alignments (MSAs). This tool allows the identification of divergent sequences in MSAs by scoring the contribution of each row in the alignment to its quality using a sum-of-pair-based method and additional analyses. Our main goal is to provide users with objective data in order to take informed decisions about the relevance and/or pertinence of including/retaining a particular sequence in an MSA. EvalMSA is written in standard Perl and also uses some routines from the statistical language R. Therefore, it is necessary to install the R-base package in order to get full functionality. Binary packages are freely available from http://sourceforge.net/projects/evalmsa/for Linux and Windows.
Chiner-Oms, Alvaro; González-Candelas, Fernando
We present EvalMSA, a software tool for evaluating and detecting outliers in multiple sequence alignments (MSAs). This tool allows the identification of divergent sequences in MSAs by scoring the contribution of each row in the alignment to its quality using a sum-of-pair-based method and additional analyses. Our main goal is to provide users with objective data in order to take informed decisions about the relevance and/or pertinence of including/retaining a particular sequence in an MSA. EvalMSA is written in standard Perl and also uses some routines from the statistical language R. Therefore, it is necessary to install the R-base package in order to get full functionality. Binary packages are freely available from http://sourceforge.net/projects/evalmsa/for Linux and Windows. PMID:27920488
Sakib, Muhammad Nazmus; Tang, Jijun; Zheng, W Jim; Huang, Chin-Tser
Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly.
Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad
In this paper, we address the large-scale biological sequence alignment problem, which has an increasing demand in computational biology. We employ data parallelism paradigm that is suitable for handling large-scale processing on multi-core computers to achieve a high degree of parallelism. Using the data parallelism paradigm, we propose a general strategy which can be used to speed up any multiple sequence alignment method. We applied five different clustering algorithms in our strategy and implemented rigorous tests on an 8-core computer using four traditional benchmarks and artificially generated sequences. The results show that our multi-core-based implementations can achieve up to 151-fold improvements in execution time while losing 2.19% accuracy on average. The source code of the proposed strategy, together with the test sets used in our analysis, is available on request. Copyright © 2013 Elsevier Ltd. All rights reserved.
Gautheret, D; Lambert, A
We present here a new approach to the problem of defining RNA signatures and finding their occurrences in sequence databases. The proposed method is based on "secondary structure profiles". An RNA sequence alignment with secondary structure information is used as an input. Two types of weight matrices/profiles are constructed from this alignment: single strands are represented by a classical lod-scores profile while helical regions are represented by an extended "helical profile" comprising 16 lod-scores per position, one for each of the 16 possible base-pairs. Database searches are then conducted using a simultaneous search for helical profiles and dynamic programming alignment of single strand profiles. The algorithm has been implemented into a new software, ERPIN, that performs both profile construction and database search. Applications are presented for several RNA motifs. The automated use of sequence information in both single-stranded and helical regions yields better sensitivity/specificity ratios than descriptor-based programs. Furthermore, since the translation of alignments into profiles is straightforward with ERPIN, iterative searches can easily be conducted to enrich collections of homologous RNAs. Copyright 2001 Academic Press.
Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.
The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.
Webb-Robertson, Bobbie-Jo M.; Oehmen, Chris S.; Matzke, Melissa M.
Using biopolymer sequence comparison methods to identify evolutionarily related proteins is one of the most common tasks in bioinformatics. Recently, support vector machines (SVMs) utilizing statistical learning theory have been employed in the problem of remote homology detection and shown to outperform iterative profile methods such as PSI-BLAST. In this study we demonstrate the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation.
Jabado, Omar J.; Palacios, Gustavo; Kapoor, Vishal; Hui, Jeffrey; Renwick, Neil; Zhai, Junhui; Briese, Thomas; Lipkin, W. Ian
Polymerase chain reaction (PCR) is widely applied in clinical and environmental microbiology. Primer design is key to the development of successful assays and is often performed manually by using multiple nucleic acid alignments. Few public software tools exist that allow comprehensive design of degenerate primers for large groups of related targets based on complex multiple sequence alignments. Here we present a method for designing such primers based on tree building followed by application of a set covering algorithm, and demonstrate its utility in compiling Multiplex PCR primer panels for detection and differentiation of viral pathogens. PMID:17135211
Evans, P A; Wareham, H T
Given the problem of mutation saturation in ancient molecular sequences, there is great interest in inferring phylogenies from higher-order types of molecular data that change more slowly, such as genomic organization and the secondary and tertiary structures of ribosomal RNA and proteins. In this paper, we define edit distances based on two representations of RNA secondary structure, arc annotation and hierarchical string annotation, and give algorithms for computing these distances on pairs of annotated sequences, aligning pairs of annotated sequences, and computing 3-median annotated sequences from triples of annotated sequences. The 3-median algorithms can be used as part of a well-known iterative heuristic for inferring phylogenies. All given algorithms are adapted from algorithms for computing longest common annotated subsequences of pairs of annotated sequences.
Gillette, Todd Aaron
Neuronal morphology is a key mediator of neuronal function, defining the profile of connectivity and shaping signal integration and propagation. Reconstructing neurite processes is technically challenging and thus data has historically been relatively sparse. Data collection and curation along with more efficient and reliable data production methods provide opportunities for the application of informatics to find new relationships and more effectively explore the field. This dissertation presents a method for aiding the development of data production as well as a novel representation and set of analyses for extracting morphological patterns. The DIADEM Challenge was organized for the purposes of determining the state of the art in automated neuronal reconstruction and what existing challenges remained. As one of the co-organizers of the Challenge, I developed the DIADEM metric, a tool designed to measure the effectiveness of automated reconstruction algorithms by comparing resulting reconstructions to expert-produced gold standards and identifying errors of various types. It has been used in the DIADEM Challenge and in the testing of several algorithms since. Further, this dissertation describes a topological sequence representation of neuronal trees amenable to various forms of sequence analysis, notably motif analysis, global pairwise alignment, clustering, and multiple sequence alignment. Motif analysis of neuronal arbors shows a large difference in bifurcation type proportions between axons and dendrites, but that relatively simple growth mechanisms account for most higher order motifs. Pairwise global alignment of topological sequences, modified from traditional sequence alignment to preserve tree relationships, enabled cluster analysis which displayed strong correspondence with known cell classes by cell type, species, and brain region. Multiple alignment of sequences in selected clusters enabled the extraction of conserved features, revealing mouse
Wallace, Iain M.; O'Sullivan, Orla; Higgins, Desmond G.; Notredame, Cedric
We introduce M-Coffee, a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA. M-Coffee is an extension of T-Coffee and uses consistency to estimate a consensus alignment. We show that the procedure is robust to variations in the choice of constituent methods and reasonably tolerant to duplicate MSAs. We also show that performances can be improved by carefully selecting the constituent methods. M-Coffee outperforms all the individual methods on three major reference datasets: HOMSTRAD, Prefab and Balibase. We also show that on a case-by-case basis, M-Coffee is twice as likely to deliver the best alignment than any individual method. Given a collection of pre-computed MSAs, M-Coffee has similar CPU requirements to the original T-Coffee. M-Coffee is a freeware open-source package available from . PMID:16556910
Kumar, Rajnish; Mishra, Bharat Kumar; Lahiri, Tapobrata; Kumar, Gautam; Kumar, Nilesh; Gupta, Rahul; Pal, Manoj Kumar
Online retrieval of the homologous nucleotide sequences through existing alignment techniques is a common practice against the given database of sequences. The salient point of these techniques is their dependence on local alignment techniques and scoring matrices the reliability of which is limited by computational complexity and accuracy. Toward this direction, this work offers a novel way for numerical representation of genes which can further help in dividing the data space into smaller partitions helping formation of a search tree. In this context, this paper introduces a 36-dimensional Periodicity Count Value (PCV) which is representative of a particular nucleotide sequence and created through adaptation from the concept of stochastic model of Kolekar et al. (American Institute of Physics 1298:307-312, 2010. doi: 10.1063/1.3516320 ). The PCV construct uses information on physicochemical properties of nucleotides and their positional distribution pattern within a gene. It is observed that PCV representation of gene reduces computational cost in the calculation of distances between a pair of genes while being consistent with the existing methods. The validity of PCV-based method was further tested through their use in molecular phylogeny constructs in comparison with that using existing sequence alignment methods.
Schreiber, Fabian; Wörheide, Gert; Morgenstern, Burkhard
In the absence of whole genome sequences for many organisms, the use of expressed sequence tags (EST) offers an affordable approach for researchers conducting phylogenetic analyses to gain insight about the evolutionary history of organisms. Reliable alignments for phylogenomic analyses are based on orthologous gene sequences from different taxa. So far, researchers have not sufficiently tackled the problem of the completely automated construction of such datasets. Existing software tools are either semi-automated, covering only part of the necessary data processing, or implemented as a pipeline, requiring the installation and configuration of a cascade of external tools, which may be time-consuming and hard to manage. To simplify data set construction for phylogenomic studies, we set up a web server that uses our recently developed OrthoSelect approach. To the best of our knowledge, our web server is the first web-based EST analysis pipeline that allows the detection of orthologous gene sequences in EST libraries and outputs orthologous gene alignments. Additionally, OrthoSelect provides the user with an extensive results section that lists and visualizes all important results, such as annotations, data matrices for each gene/taxon and orthologous gene alignments. The web server is available at http://orthoselect.gobics.de.
Jarasch, Alexander; Kopp, Melanie; Eggenstein, Evelyn; Richter, Antonia; Gebauer, Michaela; Skerra, Arne
ANTIC ALIGN: is an interactive software developed to simultaneously visualize, analyze and modify alignments of DNA and/or protein sequences that arise during combinatorial protein engineering, design and selection. ANTIC ALIGN: combines powerful functions known from currently available sequence analysis tools with unique features for protein engineering, in particular the possibility to display and manipulate nucleotide sequences and their translated amino acid sequences at the same time. ANTIC ALIGN: offers both template-based multiple sequence alignment (MSA), using the unmutated protein as reference, and conventional global alignment, to compare sequences that share an evolutionary relationship. The application of similarity-based clustering algorithms facilitates the identification of duplicates or of conserved sequence features among a set of selected clones. Imported nucleotide sequences from DNA sequence analysis are automatically translated into the corresponding amino acid sequences and displayed, offering numerous options for selecting reading frames, highlighting of sequence features and graphical layout of the MSA. The MSA complexity can be reduced by hiding the conserved nucleotide and/or amino acid residues, thus putting emphasis on the relevant mutated positions. ANTIC ALIGN: is also able to handle suppressed stop codons or even to incorporate non-natural amino acids into a coding sequence. We demonstrate crucial functions of ANTIC ALIGN: in an example of Anticalins selected from a lipocalin random library against the fibronectin extradomain B (ED-B), an established marker of tumor vasculature. Apart from engineered protein scaffolds, ANTIC ALIGN: provides a powerful tool in the area of antibody engineering and for directed enzyme evolution. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: firstname.lastname@example.org.
Armstead, Ian; Huang, Lin; King, Julie; Ougham, Helen; Thomas, Howard; King, Ian
Background Various methods have been developed to explore inter-genomic relationships among plant species. Here, we present a sequence similarity analysis based upon comparison of transcript-assembly and methylation-filtered databases from five plant species and physically anchored rice coding sequences. Results A comparison of the frequency of sequence alignments, determined by MegaBLAST, between rice coding sequences in TIGR pseudomolecules and annotations vs 4.0 and comprehensive transcript-assembly and methylation-filtered databases from Lolium perenne (ryegrass), Zea mays (maize), Hordeum vulgare (barley), Glycine max (soybean) and Arabidopsis thaliana (thale cress) was undertaken. Each rice pseudomolecule was divided into 10 segments, each containing 10% of the functionally annotated, expressed genes. This indicated a correlation between relative segment position in the rice genome and numbers of alignments with all the queried monocot and dicot plant databases. Colour-coded moving windows of 100 functionally annotated, expressed genes along each pseudomolecule were used to generate 'heat-maps'. These revealed consistent intra- and inter-pseudomolecule variation in the relative concentrations of significant alignments with the tested plant databases. Analysis of the annotations and derived putative expression patterns of rice genes from 'hot-spots' and 'cold-spots' within the heat maps indicated possible functional differences. A similar comparison relating to ancestral duplications of the rice genome indicated that duplications were often associated with 'hot-spots'. Conclusion Physical positions of expressed genes in the rice genome are correlated with the degree of conservation of similar sequences in the transcriptomes of other plant species. This relative conservation is associated with the distribution of different sized gene families and segmentally duplicated loci and may have functional and evolutionary implications. PMID:17708759
Chatterjee, Aniruddha; Stockwell, Peter A.; Rodger, Euan J.; Morison, Ian M.
Recent advances in next generation sequencing (NGS) technology now provide the opportunity to rapidly interrogate the methylation status of the genome. However, there are challenges in handling and interpretation of the methylation sequence data because of its large volume and the consequences of bisulphite modification. We sequenced reduced representation human genomes on the Illumina platform and efficiently mapped and visualized the data with different pipelines and software packages. We examined three pipelines for aligning bisulphite converted sequencing reads and compared their performance. We also comment on pre-processing and quality control of Illumina data. This comparison highlights differences in methods for NGS data processing and provides guidance to advance sequence-based methylation data analysis for molecular biologists. PMID:22344695
Md Mukarram Hossain, A S; Blackburne, Benjamin P; Shah, Abhijeet; Whelan, Simon
Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference. © The Author(s) 2015. Published by Oxford University Press on behalf of the
Davis, J P; Janjić, N; Pribnow, D; Zichi, D A
We present a computer-aided approach for identifying and aligning consensus secondary structure within a set of functionally related oligonucleotide sequences aligned by sequence. The method relies on visualization of secondary structure using a generalization of the dot matrix representation appropriate for consensus sequence data sets. An interactive computer program implementing such a visualization of consensus structure has been developed. The program allows for alignment editing, data and display filtering and various modes of base pair representation, including co-variation. The utility of this approach is demonstrated with four sample data sets derived from in vitro selection experiments and one data set comprising tRNA sequences. Images PMID:7501472
Mielczarek, M; Szyda, J
Application of the massive parallel sequencing technology has become one of the most important issues in life sciences. Therefore, it was crucial to develop bioinformatics tools for next-generation sequencing (NGS) data processing. Currently, two of the most significant tasks include alignment to a reference genome and detection of single nucleotide polymorphisms (SNPs). In many types of genomic analyses, great numbers of reads need to be mapped to the reference genome; therefore, selection of the aligner is an essential step in NGS pipelines. Two main algorithms-suffix tries and hash tables-have been introduced for this purpose. Suffix array-based aligners are memory-efficient and work faster than hash-based aligners, but they are less accurate. In contrast, hash table algorithms tend to be slower, but more sensitive. SNP and genotype callers may also be divided into two main different approaches: heuristic and probabilistic methods. A variety of software has been subsequently developed over the past several years. In this paper, we briefly review the current development of NGS data processing algorithms and present the available software.
Hickey, Glenn; Blanchette, Mathieu
Probabilistic approaches for sequence alignment are usually based on pair Hidden Markov Models (HMMs) or Stochastic Context Free Grammars (SCFGs). Recent studies have shown a significant correlation between the content of short indels and their flanking regions, which by definition cannot be modelled by the above two approaches. In this work, we present a context-sensitive indel model based on a pair Tree-Adjoining Grammar (TAG), along with accompanying algorithms for efficient alignment and parameter estimation. The increased precision and statistical power of this model is shown on simulated and real genomic data. As the cost of sequencing plummets, the usefulness of comparative analysis is becoming limited by alignment accuracy rather than data availability. Our results will therefore have an impact on any type of downstream comparative genomics analyses that rely on alignments. Fine-grained studies of small functional regions or disease markers, for example, could be significantly improved by our method. The implementation is available at http://www.mcb.mcgill.ca/~blanchem/software.html
Ovacik, Meric A.; Androulakis, Ioannis P.
Pathway-based information has become an important source of information for both establishing evolutionary relationships and understanding the mode of action of a chemical or pharmaceutical among species. Cross-species comparison of pathways can address two broad questions: comparison in order to inform evolutionary relationships and to extrapolate species differences used in a number of different applications including drug and toxicity testing. Cross-species comparison of metabolic pathways is complex as there are multiple features of a pathway that can be modeled and compared. Among the various methods that have been proposed, reaction alignment has emerged as the most successful at predicting phylogenetic relationships based on NCBI taxonomy. We propose an improvement of the reaction alignment method by accounting for sequence similarity in addition to reaction alignment method. Using nine species, including human and some model organisms and test species, we evaluate the standard and improved comparison methods by analyzing glycolysis and citrate cycle pathways conservation. In addition, we demonstrate how organism comparison can be conducted by accounting for the cumulative information retrieved from nine pathways in central metabolism as well as a more complete study involving 36 pathways common in all nine species. Our results indicate that reaction alignment with enzyme sequence similarity results in a more accurate representation of pathway specific cross-species similarities and differences based on NCBI taxonomy.
Yen, Ian E. H.; Lin, Xin; Zhang, Jiong; Ravikumar, Pradeep; Dhillon, Inderjit S.
Multiple Sequence Alignment and Motif Discovery, known as NP-hard problems, are two fundamental tasks in Bioinformatics. Existing approaches to these two problems are based on either local search methods such as Expectation Maximization (EM), Gibbs Sampling or greedy heuristic methods. In this work, we develop a convex relaxation approach to both problems based on the recent concept of atomic norm and develop a new algorithm, termed Greedy Direction Method of Multiplier, for solving the convex relaxation with two convex atomic constraints. Experiments show that our convex relaxation approach produces solutions of higher quality than those standard tools widely-used in Bioinformatics community on the Multiple Sequence Alignment and Motif Discovery problems. PMID:27559428
Chu, Wei; Ghahramani, Zoubin; Podtelezhnikov, Alexei; Wild, David L
In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in beta-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html.
Hara, Toshihide; Sato, Keiko; Ohya, Masanori
Sequence alignment is one of the most important techniques to analyze biological systems. It is also true that the alignment is not complete and we have to develop it to look for more accurate method. In particular, an alignment for homologous sequences with low sequence similarity is not in satisfactory level. Usual methods for aligning protein sequences in recent years use a measure empirically determined. As an example, a measure is usually defined by a combination of two quantities (1) and (2) below: (1) the sum of substitutions between two residue segments, (2) the sum of gap penalties in insertion/deletion region. Such a measure is determined on the assumption that there is no an intersite correlation on the sequences. In this paper, we improve the alignment by taking the correlation of consecutive residues. We introduced a new method of alignment, called MTRAP by introducing a metric defined on compound systems of two sequences. In the benchmark tests by PREFAB 4.0 and HOMSTRAD, our pairwise alignment method gives higher accuracy than other methods such as ClustalW2, TCoffee, MAFFT. Especially for the sequences with sequence identity less than 15%, our method improves the alignment accuracy significantly. Moreover, we also showed that our algorithm works well together with a consistency-based progressive multiple alignment by modifying the TCoffee to use our measure. We indicated that our method leads to a significant increase in alignment accuracy compared with other methods. Our improvement is especially clear in low identity range of sequences. The source code is available at our web page, whose address is found in the section "Availability and requirements".
Miller, Robert T.; Christoffels, Alan G.; Gopalakrishnan, Chella; Burke, John; Ptitsyn, Andrey A.; Broveak, Tania R.; Hide, Winston A.
The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313,103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a factor of 1.86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented. PMID:10568754
Roozgard, Aminmohammad; Barzigar, Nafise; Wang, Shuang; Jiang, Xiaoqian; Cheng, Samuel
The advance in human genome sequencing technology has significantly reduced the cost of data generation and overwhelms the computing capability of sequence analysis. Efficiency, efficacy, and scalability remain challenging in sequence alignment, which is an important and foundational operation for genome data analysis. In this paper, we propose a two-stage approach to tackle this problem. In the preprocessing step, we match blocks of reference and target sequences based on the similarities between their empirical transition probability distributions using belief propagation. We then conduct a refined match using our recently published sparse-coding belief propagation (SCoBeP) technique. Our experimental results demonstrated robustness in nucleotide sequence alignment, and our results are competitive to those of the SOAP aligner and the BWA algorithm. Moreover, compared to SCoBeP alignment, the proposed technique can handle sequences of much longer lengths. PMID:25983537
Taheri, Javid; Zomaya, Albert Y
Background Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. Results This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. Conclusion RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences. PMID:19594869
Hahn, Lars; Leimeister, Chris-André; Morgenstern, Burkhard
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/ PMID:27760124
Ndhlovu, Andrew; Hazelhurst, Scott; Durand, Pierre M
Selective pressures at the DNA level shape genes into profiles consisting of patterns of rapidly evolving sites and sites withstanding change. These profiles remain detectable even when protein sequences become extensively diverged. A common task in molecular biology is to infer functional, structural or evolutionary relationships by querying a database using an algorithm. However, problems arise when sequence similarity is low. This study presents an algorithm that uses the evolutionary rate at codon sites, the dN/dS (ω) parameter, coupled to a substitution matrix as an alignment metric for detecting distantly related proteins. The algorithm, called BLOSUM-FIRE couples a newer and improved version of the original FIRE (Functional Inference using Rates of Evolution) algorithm with an amino acid substitution matrix in a dynamic scoring function. The enigmatic hepatitis B virus X protein was used as a test case for BLOSUM-FIRE and its associated database EvoDB. The evolutionary rate based approach was coupled with a conventional BLOSUM substitution matrix. The two approaches are combined in a dynamic scoring function, which uses the selective pressure to score aligned residues. The dynamic scoring function is based on a coupled additive approach that scores aligned sites based on the level of conservation inferred from the ω values. Evaluation of the accuracy of this new implementation, BLOSUM-FIRE, using MAFFT alignment as reference alignments has shown that it is more accurate than its predecessor FIRE. Comparison of the alignment quality with widely used algorithms (MUSCLE, T-COFFEE, and CLUSTAL Omega) revealed that the BLOSUM-FIRE algorithm performs as well as conventional algorithms. Its main strength lies in that it provides greater potential for aligning divergent sequences and addresses the problem of low specificity inherent in the original FIRE algorithm. The utility of this algorithm is demonstrated using the Hepatitis B virus X (HBx) protein, a protein
Tatusov, R L; Altschul, S F; Koonin, E V
We describe an approach to analyzing protein sequence databases that, starting from a single uncharacterized sequence or group of related sequences, generates blocks of conserved segments. The procedure involves iterative database scans with an evolving position-dependent weight matrix constructed from a coevolving set of aligned conserved segments. For each iteration, the expected distribution of matrix scores under a random model is used to set a cutoff score for the inclusion of a segment in the next iteration. This cutoff may be calculated to allow the chance inclusion of either a fixed number or a fixed proportion of false positive segments. With sufficiently high cutoff scores, the procedure converged for all alignment blocks studied, with varying numbers of iterations required. Different methods for calculating weight matrices from alignment blocks were compared. The most effective of those tested was a logarithm-of-odds, Bayesian-based approach that used prior residue probabilities calculated from a mixture of Dirichlet distributions. The procedure described was used to detect novel conserved motifs of potential biological importance. Images PMID:7991589
Sheetlin, Sergey; Park, Yonil; Spouge, John L.
Sequence alignment is an indispensable computational tool in modern molecular biology. The model underlying biological sequence alignment is of interest to physicists because it approximates the statistical mechanics of DNA and protein annealing, while bearing an intimate relationship to models of directed polymers in random media. Recent methods for determining the statistics of random sequence alignments have reduced the computation time to less than 1 s, opening up some interesting possibilities for online computation with biological search engines. Before implementation, however, the methods required an objective technique for computing regression coefficients pertinent to an asymptotic regime. Typically, physicists estimate parameters pertinent to an asymptotic regime subjectively: They eyeball their data; estimate the asymptotic regime where the regression model holds with reasonable accuracy; and then regress data only within the estimated asymptotic regime. Our publicly available computer program arrp replaces the subjective assessment of the asymptotic regime with an objective change-point detection method, increasing confidence in the scientific objectivity of the parameter estimates. Asymptotic regression has potential applications across most of physics.
Bond, Stephen R; Keat, Karl E; Barreira, Sofia N; Baxevanis, Andreas D
The ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite wiki. All software is open source and freely available through http://research.nhgri.nih.gov/software/BuddySuite.
Loving, Joshua; Hernandez, Yozen; Benson, Gary
Motivation: Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations composed of AND, OR, XOR, complement, shift and addition. Bit-parallelism has been successfully applied to the longest common subsequence (LCS) and edit-distance problems, producing fast algorithms in practice. Results: We have developed BitPAl, a bit-parallel algorithm for general, integer-scoring global alignment. Integer-scoring schemes assign integer weights for match, mismatch and insertion/deletion. The BitPAl method uses structural properties in the relationship between adjacent scores in the scoring matrix to construct classes of efficient algorithms, each designed for a particular set of weights. In timed tests, we show that BitPAl runs 7–25 times faster than a standard iterative algorithm. Availability and implementation: Source code is freely available for download at http://lobstah.bu.edu/BitPAl/BitPAl.html. BitPAl is implemented in C and runs on all major operating systems. Contact: email@example.com or firstname.lastname@example.org or email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25075119
Lee, Marianne M; Bundschuh, Ralf; Chan, Michael K
A new machine learning algorithm, LESTAT (LEngth and STructure-based sequence Alignment Tool) has been developed for detecting protein homologs having low-sequence identity. LESTAT is an iterative profile-based method that runs without reliance on a predefined library and incorporates several novel features that enhance its ability to identify remote sequences. To overcome the inherent bias associated with a single starting model, LESTAT utilizes three structural homologs to create a profile consisting of structurally conserved positions and block separation distances. Subsequent profiles are refined iteratively using sequence information obtained from previous cycles. Additionally, the refinement process incorporates a "lock-in" feature to retain the high-scoring sequences involved in previous alignments for subsequent model building and an enhancement factor to complement the weighting scheme used to build the position specific scoring matrix. A comparison of the performance of LESTAT against PSI-BLAST for seven systems reveals that LESTAT exhibits increased sensitivity and specificity over PSI-BLAST in six of these systems, based on the number of true homologs detected and the number of families these homologs covered. Notably, many of the hits identified are unique to each method, presumably resulting from the distinct differences in the two approaches. Taken together, these findings suggest that LESTAT is a useful complementary method to PSI-BLAST in the detection of distant homologs.
Borrayo, Ernesto; Mendizabal-Ruiz, E. Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P.; Morales, J. Alejandro
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments. PMID:25393409
Borrayo, Ernesto; Mendizabal-Ruiz, E Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P; Morales, J Alejandro
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.
Manavski, Svetlin A; Valle, Giorgio
Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment
Howard, James B.; Kechris, Katerina J.; Rees, Douglas C.; Glazer, Alexander N.
Amino acid residues critical for a protein's structure-function are retained by natural selection and these residues are identified by the level of variance in co-aligned homologous protein sequences. The relevant residues in the nitrogen fixation Component 1 α- and β-subunits were identified by the alignment of 95 protein sequences. Proteins were included from species encompassing multiple microbial phyla and diverse ecological niches as well as the nitrogen fixation genotypes, anf, nif, and vnf, which encode proteins associated with cofactors differing at one metal site. After adjusting for differences in sequence length, insertions, and deletions, the remaining >85% of the sequence co-aligned the subunits from the three genotypes. Six Groups, designated Anf, Vnf , and Nif I-IV, were assigned based upon genetic origin, sequence adjustments, and conserved residues. Both subunits subdivided into the same groups. Invariant and single variant residues were identified and were defined as “core” for nitrogenase function. Three species in Group Nif-III, Candidatus Desulforudis audaxviator, Desulfotomaculum kuznetsovii, and Thermodesulfatator indicus, were found to have a seleno-cysteine that replaces one cysteinyl ligand of the 8Fe:7S, P-cluster. Subsets of invariant residues, limited to individual groups, were identified; these unique residues help identify the gene of origin (anf, nif, or vnf) yet should not be considered diagnostic of the metal content of associated cofactors. Fourteen of the 19 residues that compose the cofactor pocket are invariant or single variant; the other five residues are highly variable but do not correlate with the putative metal content of the cofactor. The variable residues are clustered on one side of the cofactor, away from other functional centers in the three dimensional structure. Many of the invariant and single variant residues were not previously recognized as potentially critical and their identification provides the bases
Howard, James B; Kechris, Katerina J; Rees, Douglas C; Glazer, Alexander N
Amino acid residues critical for a protein's structure-function are retained by natural selection and these residues are identified by the level of variance in co-aligned homologous protein sequences. The relevant residues in the nitrogen fixation Component 1 α- and β-subunits were identified by the alignment of 95 protein sequences. Proteins were included from species encompassing multiple microbial phyla and diverse ecological niches as well as the nitrogen fixation genotypes, anf, nif, and vnf, which encode proteins associated with cofactors differing at one metal site. After adjusting for differences in sequence length, insertions, and deletions, the remaining >85% of the sequence co-aligned the subunits from the three genotypes. Six Groups, designated Anf, Vnf , and Nif I-IV, were assigned based upon genetic origin, sequence adjustments, and conserved residues. Both subunits subdivided into the same groups. Invariant and single variant residues were identified and were defined as "core" for nitrogenase function. Three species in Group Nif-III, Candidatus Desulforudis audaxviator, Desulfotomaculum kuznetsovii, and Thermodesulfatator indicus, were found to have a seleno-cysteine that replaces one cysteinyl ligand of the 8Fe:7S, P-cluster. Subsets of invariant residues, limited to individual groups, were identified; these unique residues help identify the gene of origin (anf, nif, or vnf) yet should not be considered diagnostic of the metal content of associated cofactors. Fourteen of the 19 residues that compose the cofactor pocket are invariant or single variant; the other five residues are highly variable but do not correlate with the putative metal content of the cofactor. The variable residues are clustered on one side of the cofactor, away from other functional centers in the three dimensional structure. Many of the invariant and single variant residues were not previously recognized as potentially critical and their identification provides the bases for
A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro re-aligner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here. PMID:20932289
Szalkowski, Adam M.; Anisimova, Maria
Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family. PMID:23877246
Liu, George E; Matukumalli, Lakshmi K; Sonstegard, Tad S; Shade, Larry L; Van Tassell, Curtis P
Background Approximately 11 Mb of finished high quality genomic sequences were sampled from cattle, dog and human to estimate genomic divergences and their regional variation among these lineages. Results Optimal three-way multi-species global sequence alignments for 84 cattle clones or loci (each >50 kb of genomic sequence) were constructed using the human and dog genome assemblies as references. Genomic divergences and substitution rates were examined for each clone and for various sequence classes under different functional constraints. Analysis of these alignments revealed that the overall genomic divergences are relatively constant (0.32–0.37 change/site) for pairwise comparisons among cattle, dog and human; however substitution rates vary across genomic regions and among different sequence classes. A neutral mutation rate (2.0–2.2 × 10(-9) change/site/year) was derived from ancestral repetitive sequences, whereas the substitution rate in coding sequences (1.1 × 10(-9) change/site/year) was approximately half of the overall rate (1.9–2.0 × 10(-9) change/site/year). Relative rate tests also indicated that cattle have a significantly faster rate of substitution as compared to dog and that this difference is about 6%. Conclusion This analysis provides a large-scale and unbiased assessment of genomic divergences and regional variation of substitution rates among cattle, dog and human. It is expected that these data will serve as a baseline for future mammalian molecular evolution studies. PMID:16759380
Ye, Weicai; Chen, Ying; Zhang, Yongdong; Xu, Yuesheng
The sequence alignment is a fundamental problem in bioinformatics. BLAST is a routinely used tool for this purpose with over 118 000 citations in the past two decades. As the size of bio-sequence databases grows exponentially, the computational speed of alignment softwares must be improved. We develop the heterogeneous BLAST (H-BLAST), a fast parallel search tool for a heterogeneous computer that couples CPUs and GPUs, to accelerate BLASTX and BLASTP-basic tools of NCBI-BLAST. H-BLAST employs a locally decoupled seed-extension algorithm for better performance on GPUs, and offers a performance tuning mechanism for better efficiency among various CPUs and GPUs combinations. H-BLAST produces identical alignment results as NCBI-BLAST and its computational speed is much faster than that of NCBI-BLAST. Speedups achieved by H-BLAST over sequential NCBI-BLASTP (resp. NCBI-BLASTX) range mostly from 4 to 10 (resp. 5 to 7.2). With 2 CPU threads and 2 GPUs, H-BLAST can be faster than 16-threaded NCBI-BLASTX. Furthermore, H-BLAST is 1.5-4 times faster than GPU-BLAST. https://github.com/Yeyke/H-BLAST.git. firstname.lastname@example.org. Supplementary data are available at Bioinformatics online.
Hadzipasic, Omar; Wrabl, James O; Hilser, Vincent J
An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm's utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.
Kinjo, Akira R.
The multiple sequence alignment (MSA) of a protein family provides a wealth of information in terms of the conservation pattern of amino acid residues not only at each alignment site but also between distant sites. In order to statistically model the MSA incorporating both short-range and long-range correlations as well as insertions, I have derived a lattice gas model of the MSA based on the principle of maximum entropy. The partition function, obtained by the transfer matrix method with a mean-field approximation, accounts for all possible alignments with all possible sequences. The model parameters for short-range and long-range interactions were determined by a self-consistent condition and by a Gaussian approximation, respectively. Using this model with and without long-range interactions, I analyzed the globin and V-set domains by increasing the “temperature” and by “mutating” a site. The correlations between residue conservation and various measures of the system’s stability indicate that the long-range interactions make the conservation pattern more specific to the structure, and increasingly stabilize better conserved residues. PMID:27924257
Thankaswamy-Kosalai, Subazini; Sen, Partho; Nookaew, Intawat
Massive data produced due to the advent of next-generation sequencing (NGS) technology is widely used for biological researches and medical diagnosis. The crucial step in NGS analysis is read alignment or mapping which is computationally intensive and complex. The mapping bias tends to affect the downstream analysis, including detection of polymorphisms. In order to provide guidelines to the biologist for suitable selection of aligners; we have evaluated and benchmarked 5 different aligners (BWA, Bowtie2, NovoAlign, Smalt and Stampy) and their mapping bias based on characteristics of 5 microbial genomes. Two million simulated read pairs of various sizes (36bp, 50bp, 72bp, 100bp, 125bp, 150bp, 200bp, 250bp and 300bp) were aligned. Specific alignment features such as sensitivity of mapping, percentage of properly paired reads, alignment time and effect of tandem repeats on incorrectly mapped reads were evaluated. BWA showed faster alignment followed by Bowtie2 and Smalt. NovoAlign and Stampy were comparatively slower. Most of the aligners showed high sensitivity towards long reads (>100bp) mapping. On the other hand NovoAlign showed higher sensitivity towards both short reads (36bp, 50bp, 72bp) and long reads (>100bp) mappings; It also showed higher sensitivity towards mapping a complex genome like Plasmodium falciparum. The percentage of properly paired reads aligned by NovoAlign, BWA and Stampy were markedly higher. None of the aligners outperforms the others in the benchmark, however the aligners perform differently with genome characteristics. We expect that the results from this study will be useful for the end user to choose aligner, thus enhance the accuracy of read mapping. Copyright © 2017 Elsevier Inc. All rights reserved.
Indonesia is the 3rd largest cocoa producing countries in the world, with an annual cacao bean production of 572,000 tons. The currently cultivated cacao varieties in Indonesia were inter-hybrids of various clones introduced from the Americas since the 16th century. Among them, “Java cocoa” is a wel...
Ye, Hao; Meehan, Joe; Tong, Weida; Hong, Huixiao
Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants. PMID:26610555
Heideman, Simone G; van Ede, Freek; Nobre, Anna C
Performance improves when participants respond to events that are structured in repeating sequences, suggesting that learning can lead to proactive anticipatory preparation. Whereas most sequence-learning studies have emphasized spatial structure, most sequences also contain a prominent temporal structure. We used MEG to investigate spatial and temporal anticipatory neural dynamics in a modified serial reaction-time (SRT) task. Performance and brain activity were compared between blocks with learned spatial-temporal sequences and blocks with new sequences. After confirming a strong behavioural benefit of spatial-temporal predictability, we show lateralisation of beta oscillations in anticipation of the response associated with the upcoming target location, and show that this also aligns to the expected timing of these forthcoming events. This effect was found both when comparing between repeated (learned) and new (unlearned) sequences, as well as when comparing targets that were expected after short vs. long intervals within the repeated (learned) sequence. Our findings suggest that learning of spatial-temporal structure leads to proactive and dynamic modulation of motor cortical excitability in anticipation of both the location and timing of events that are relevant to guide action. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Xu, Liang; Du, Junping; Zhang, Zhenhong
Multiexposure fusion images have a higher dynamic range and reveal more details than a single captured image of a real-world scene. A clear and intuitive feature-based fusion technique for multiexposure image sequences is conceptually proposed. The main idea of the proposed method is to combine three image features [phase congruency (PC), local contrast, and color saturation] to obtain weight maps of the images. Then, the weight maps are further refined using a guided filter which can improve their accuracy. The final fusion result is constructed using the weighted sum of the source image sequence. In addition, for multiexposure image-sequence fusion involving dynamic scenes containing moving objects, ghost artifacts can easily occur if fusion is directly performed. Therefore, an image-alignment method is first used to adjust the input images to correspond to a reference image, after which fusion is performed. Experimental results demonstrate that the proposed method has a superior performance compared to the existing methods.
Hung, Che-Lun; Lin, Yu-Shiang; Lin, Chun-Yuan; Chung, Yeh-Ching; Chung, Yi-Fang
For biological applications, sequence alignment is an important strategy to analyze DNA and protein sequences. Multiple sequence alignment is an essential methodology to study biological data, such as homology modeling, phylogenetic reconstruction and etc. However, multiple sequence alignment is a NP-hard problem. In the past decades, progressive approach has been proposed to successfully align multiple sequences by adopting iterative pairwise alignments. Due to rapid growth of the next generation sequencing technologies, a large number of sequences can be produced in a short period of time. When the problem instance is large, progressive alignment will be time consuming. Parallel computing is a suitable solution for such applications, and GPU is one of the important architectures for contemporary parallel computing researches. Therefore, we proposed a GPU version of ClustalW v2.0.11, called CUDA ClustalW v1.0, in this work. From the experiment results, it can be seen that the CUDA ClustalW v1.0 can achieve more than 33× speedups for overall execution time by comparing to ClustalW v2.0.11.
Wright, Imogen A.; Travers, Simon A.
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. PMID:24861618
Firnkorn, Daniel; Knaup-Gregori, Petra; Lorenzo Bermejo, Justo; Ganzinger, Matthias
In times of high-throughput DNA sequencing techniques, performance-capable analysis of DNA sequences is of high importance. Computer supported DNA analysis is still an intensive time-consuming task. In this paper we explore the potential of a new In-Memory database technology by using SAP's High Performance Analytic Appliance (HANA). We focus on read alignment as one of the first steps in DNA sequence analysis. In particular, we examined the widely used Burrows-Wheeler Aligner (BWA) and implemented stored procedures in both, HANA and the free database system MySQL, to compare execution time and memory management. To ensure that the results are comparable, MySQL has been running in memory as well, utilizing its integrated memory engine for database table creation. We implemented stored procedures, containing exact and inexact searching of DNA reads within the reference genome GRCh37. Due to technical restrictions in SAP HANA concerning recursion, the inexact matching problem could not be implemented on this platform. Hence, performance analysis between HANA and MySQL was made by comparing the execution time of the exact search procedures. Here, HANA was approximately 27 times faster than MySQL which means, that there is a high potential within the new In-Memory concepts, leading to further developments of DNA analysis procedures in the future.
O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D
The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples.
Khan, Mohammad Ibrahim; Kamal, Md Sarwar
Markov Chain is very effective in prediction basically in long data set. In DNA sequencing it is always very important to find the existence of certain nucleotides based on the previous history of the data set. We imposed the Chapman Kolmogorov equation to accomplish the task of Markov Chain. Chapman Kolmogorov equation is the key to help the address the proper places of the DNA chain and this is very powerful tools in mathematics as well as in any other prediction based research. It incorporates the score of DNA sequences calculated by various techniques. Our research utilize the fundamentals of Warshall Algorithm (WA) and Dynamic Programming (DP) to measures the score of DNA segments. The outcomes of the experiment are that Warshall Algorithm is good for small DNA sequences on the other hand Dynamic Programming are good for long DNA sequences. On the top of above findings, it is very important to measure the risk factors of local sequencing during the matching of local sequence alignments whatever the length.
Larson, S. M.; Davidson, A. R.
The SH3 domain, comprised of approximately 60 residues, is found within a wide variety of proteins, and is a mediator of protein-protein interactions. Due to the large number of SH3 domain sequences and structures in the databases, this domain provides one of the best available systems for the examination of sequence and structural conservation within a protein family. In this study, a large and diverse alignment of SH3 domain sequences was constructed, and the pattern of conservation within this alignment was compared to conserved structural features, as deduced from analysis of eighteen different SH3 domain structures. Seventeen SH3 domain structures solved in the presence of bound peptide were also examined to identify positions that are consistently most important in mediating the peptide-binding function of this domain. Although residues at the two most conserved positions in the alignment are directly involved in peptide binding, residues at most other conserved positions play structural roles, such as stabilizing turns or comprising the hydrophobic core. Surprisingly, several highly conserved side-chain to main-chain hydrogen bonds were observed in the functionally crucial RT-Src loop between residues with little direct involvement in peptide binding. These hydrogen bonds may be important for maintaining this region in the precise conformation necessary for specific peptide recognition. In addition, a previously unrecognized yet highly conserved beta-bulge was identified in the second beta-strand of the domain, which appears to provide a necessary kink in this strand, allowing it to hydrogen bond to both sheets comprising the fold. PMID:11152127
Salama, Rafik A; Stekel, Dov J
Multiple sequence alignments (MSAs) are usually scored under the assumption that the sequences being aligned have evolved by common descent. Consequently, the differences between sequences reflect the impact of insertions, deletions and mutations. However, non-coding DNA binding sequences, such as transcription factor binding sites (TFBSs), are frequently not related by common descent, and so the existing alignment scoring methods are not well suited for aligning such sequences. We present a novel multiple MSA methodology that scores TFBS DNA sequences by including the interdependence of neighboring bases. We introduced two variants supported by different underlying null hypotheses, one statistically and the other thermodynamically generated. We assessed the alignments through their performance in TFBS prediction; both methods show considerable improvements when compared with standard MSA algorithms. Moreover, the thermodynamically generated null hypothesis outperforms the statistical one due to improved stability in the base stacking free energy of the alignment. The thermodynamically generated null hypothesis method can be downloaded from http://sourceforge.net/projects/msa-edna/. email@example.com. Supplementary data are available at Bioinformatics online.
Citizen science games such as Galaxy Zoo, Foldit, and Phylo aim to harness the intelligence and processing power generated by crowds of online gamers to solve scientific problems. However, the selection of the data to be analyzed through these games is under the exclusive control of the game designers, and so are the results produced by gamers. Here, we introduce Open-Phylo, a freely accessible crowd-computing platform that enables any scientist to enter our system and use crowds of gamers to assist computer programs in solving one of the most fundamental problems in genomics: the multiple sequence alignment problem. PMID:24148814
Kwak, Daniel; Kam, Alfred; Becerra, David; Zhou, Qikuan; Hops, Adam; Zarour, Eleyine; Kam, Arthur; Sarmenta, Luis; Blanchette, Mathieu; Waldispühl, Jérôme
Citizen science games such as Galaxy Zoo, Foldit, and Phylo aim to harness the intelligence and processing power generated by crowds of online gamers to solve scientific problems. However, the selection of the data to be analyzed through these games is under the exclusive control of the game designers, and so are the results produced by gamers. Here, we introduce Open-Phylo, a freely accessible crowd-computing platform that enables any scientist to enter our system and use crowds of gamers to assist computer programs in solving one of the most fundamental problems in genomics: the multiple sequence alignment problem.
Chapman, Michael A.; Donaldson, Ian J.; Gilbert, James; Grafham, Darren; Rogers, Jane; Green, Anthony R.; Göttgens, Berthold
Comparative analysis of genomic sequences is becoming a standard technique for studying gene regulation. However, only a limited number of tools are currently available for the analysis of multiple genomic sequences. An extensive data set for the testing and training of such tools is provided by the SCL gene locus. Here we have expanded the data set to eight vertebrate species by sequencing the dog SCL locus and by annotating the dog and rat SCL loci. To provide a resource for the bioinformatics community, all SCL sequences and functional annotations, comprising a collation of the extensive experimental evidence pertaining to SCL regulation, have been made available via a Web server. A Web interface to new tools specifically designed for the display and analysis of multiple sequence alignments was also implemented. The unique SCL data set and new sequence comparison tools allowed us to perform a rigorous examination of the true benefits of multiple sequence comparisons. We demonstrate that multiple sequence alignments are, overall, superior to pairwise alignments for identification of mammalian regulatory regions. In the search for individual transcription factor binding sites, multiple alignments markedly increase the signal-to-noise ratio compared to pairwise alignments. PMID:14718377
Gentry, M.K.; Doctor, B.P.; Cygler, M.; Schrag, J.D.; Sussman, J.L.
Acetylcholinesterase and butyrylcholinesterase, enzymes with potential as pretreatment drugs for organophosphate toxicity, are members of a larger family of homologous proteins that includes carboxylesterases, cholesterol esterases, lipases, and several nonhydrolytic proteins. A computer-generated alignment of 18 of the proteins, the acetylcholinesases, butyrylcholinesterases, carboxylesterases, some esterases, and the nonenzymatic proteins has been previously presented. More recently, the three-dimensional structures of two enzymes enzymes in this group, acetylcholinesterase from Torpedo californica and lipase from Geotrichum candidum, have been determined. Based on the x-ray structures and the superposition of these two enzymes, it was possible to obtain an improved amino acid sequence alignment of 32 members of this family of proteins. Examination of this alignment reveals that 24 amino acids are invariant in all of the hydrolytic proteins, and an additional 49 are well conserved. Conserved amino acids include those of the active site, the disulfide bridges, the salt bridges, in the core of the proteins, and at the edges of secondary structural elements. Comparison of the three-dimensional structures makes it possible to find a well-defined structural basis for the conservation of many of these amino acids.
Gómez, Antonio; Cedano, Juan; Espadaler, Jordi; Hermoso, Antonio; Piñol, Jaume; Querol, Enrique
The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence.
Diaz, Maria; Ladero, Victor; Redruello, Begoña; Sanchez-Llana, Esther; Del Rio, Beatriz; Fernandez, Maria; Martin, Maria Cruz; Alvarez, Miguel A
The decarboxylation of histidine -carried out mainly by some gram-positive bacteria- yields the toxic dietary biogenic amine histamine (Ladero et al. 2010 〈10.2174/157340110791233256〉 , Linares et al. 2016 〈http://dx.doi.org/10.1016/j.foodchem.2015.11.013〉〉 ). The reaction is catalyzed by a pyruvoyl-dependent histidine decarboxylase (Linares et al. 2011 〈10.1080/10408398.2011.582813〉 ), which is encoded by the gene hdcA. In order to locate conserved regions in the hdcA gene of Gram-positive bacteria, this article provides a nucleotide sequence alignment of all the hdcA sequences from Gram-positive bacteria present in databases. For further utility and discussion, see 〈http://dx.doi.org/ 10.1016/j.foodcont.2015.11.035〉〉 .
Diaz, Maria; Ladero, Victor; Redruello, Begoña; Sanchez-Llana, Esther; del Rio, Beatriz; Fernandez, Maria; Martin, Maria Cruz; Alvarez, Miguel A.
The decarboxylation of histidine -carried out mainly by some gram-positive bacteria- yields the toxic dietary biogenic amine histamine (Ladero et al. 2010 〈10.2174/157340110791233256〉 , Linares et al. 2016 〈http://dx.doi.org/10.1016/j.foodchem.2015.11.013〉〉 ). The reaction is catalyzed by a pyruvoyl-dependent histidine decarboxylase (Linares et al. 2011 〈10.1080/10408398.2011.582813〉 ), which is encoded by the gene hdcA. In order to locate conserved regions in the hdcA gene of Gram-positive bacteria, this article provides a nucleotide sequence alignment of all the hdcA sequences from Gram-positive bacteria present in databases. For further utility and discussion, see 〈http://dx.doi.org/ 10.1016/j.foodcont.2015.11.035〉〉 . PMID:26958625
Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J.; Ferreira, Paulo J.S.G.
Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail. PMID:25984837
Background Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N3) in memory. This is only practical for small RNAs. Results I describe a divide and conquer variant of the alignment algorithm that is analogous to memory-efficient Myers/Miller dynamic programming algorithms for linear sequence alignment. The new algorithm has an O(N2 log N) memory complexity, at the expense of a small constant factor in time. Conclusions Optimal ribosomal RNA structural alignments that previously required up to 150 GB of memory now require less than 270 MB. PMID:12095421
Gudyś, Adam; Deorowicz, Sebastian
Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.
Gudyś, Adam; Deorowicz, Sebastian
Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435
Zemali, El-Amine; Boukra, Abdelmadjid
The multiple sequence alignment (MSA) is one of the most challenging problems in bioinformatics, it involves discovering similarity between a set of protein or DNA sequences. This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP). It is based on a recent metaheuristic inspired from the mathematics of biogeography named biogeography-based optimization (BBO). To improve the exploration ability of BBO, we have introduced a new concept allowing better exploration of the search space. It consists of manipulating multiple populations having each one its own parameters. These parameters are used to build up progressive alignments allowing more diversity. At each iteration, the best found solution is injected in each population. Moreover, to improve solution quality, six operators are defined. These operators are selected with a dynamic probability which changes according to the operators efficiency. In order to test proposed approach performance, we have considered a set of datasets from Balibase 2.0 and compared it with many recent algorithms such as GAPAM, MSA-GA, QEAMSA and RBT-GA. The results show that the proposed approach achieves better average score than the previously cited methods.
Semegni, J Y; Wamalwa, M; Gaujoux, R; Harkins, G W; Gray, A; Martin, D P
Many natural nucleic acid sequences have evolutionarily conserved secondary structures with diverse biological functions. A reliable computational tool for identifying such structures would be very useful in guiding experimental analyses of their biological functions. NASP (Nucleic Acid Structure Predictor) is a program that takes into account thermodynamic stability, Boltzmann base pair probabilities, alignment uncertainty, covarying sites and evolutionary conservation to identify biologically relevant secondary structures within multiple sequence alignments. Unique to NASP is the consideration of all this information together with a recursive permutation-based approach to progressively identify and list the most conserved probable secondary structures that are likely to have the greatest biological relevance. By focusing on identifying only evolutionarily conserved structures, NASP forgoes the prediction of complete nucleotide folds but outperforms various other secondary structure prediction methods in its ability to selectively identify actual base pairings. Downloable and web-based versions of NASP are freely available at http://web.cbio.uct.ac.za/~yves/nasp_portal.php firstname.lastname@example.org Supplementary data are available at Bioinformatics online.
Huo, Hongwei; Xie, Qiaoluan; Shen, Xubang; Stojkovic, Vojislav
This paper presents an original Quantum Genetic algorithm for Multiple sequence ALIGNment (QGMALIGN) that combines a genetic algorithm and a quantum algorithm. A quantum probabilistic coding is designed for representing the multiple sequence alignment. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The features of implicit parallelism and state superposition in quantum mechanics and the global search capability of the genetic algorithm are exploited to get efficient computation. A set of well known test cases from BAliBASE2.0 is used as reference to evaluate the efficiency of the QGMALIGN optimization. The QGMALIGN results have been compared with the most popular methods (CLUSTALX, SAGA, DIALIGN, SB_PIMA, and QGMALIGN) results. The QGMALIGN results show that QGMALIGN performs well on the presenting biological data. The addition of genetic operators to the quantum algorithm lowers the cost of overall running time.
Sul, Woo Jun; Cole, James R.; Jesus, Ederson da C.; Wang, Qiong; Farris, Ryan J.; Fish, Jordan A.; Tiedje, James M.
High-throughput sequencing of 16S rRNA genes has increased our understanding of microbial community structure, but now even higher-throughput methods to the Illumina scale allow the creation of much larger datasets with more samples and orders-of-magnitude more sequences that swamp current analytic methods. We developed a method capable of handling these larger datasets on the basis of assignment of sequences into an existing taxonomy using a supervised learning approach (taxonomy-supervised analysis). We compared this method with a commonly used clustering approach based on sequence similarity (taxonomy-unsupervised analysis). We sampled 211 different bacterial communities from various habitats and obtained ∼1.3 million 16S rRNA sequences spanning the V4 hypervariable region by pyrosequencing. Both methodologies gave similar ecological conclusions in that β-diversity measures calculated by using these two types of matrices were significantly correlated to each other, as were the ordination configurations and hierarchical clustering dendrograms. In addition, our taxonomy-supervised analyses were also highly correlated with phylogenetic methods, such as UniFrac. The taxonomy-supervised analysis has the advantages that it is not limited by the exhaustive computation required for the alignment and clustering necessary for the taxonomy-unsupervised analysis, is more tolerant of sequencing errors, and allows comparisons when sequences are from different regions of the 16S rRNA gene. With the tremendous expansion in 16S rRNA data acquisition underway, the taxonomy-supervised approach offers the potential to provide more rapid and extensive community comparisons across habitats and samples. PMID:21873204
Phinney, Eric J.; Mann, Paul; Coffin, Millard F.; Shipley, Thomas H.
Possibilities for the fate of oceanic plateaus at subduction zones range from complete subduction of the plateau beneath the arc to complete plateau-arc accretion and resulting collisional orogenesis. Deep penetration, multi-channel seismic reflection (MCS) data from the northern flank of the Solomon Islands reveal the sequence stratigraphy, structural style, and age of deformation of an accretionary prism formed during late Neogene (5-0 Ma) convergence between the ˜33-km-thick crust of the Ontong Java oceanic plateau and the ˜15-km-thick Solomon island arc. Correlation of MCS data with the satellite-derived, free-air gravity field defines the tectonic boundaries and internal structure of the 800-km-long, 140-km-wide accretionary prism. We name this prism the "Malaita accretionary prism" or "MAP" after Malaita, the largest and best-studied island exposure of the accretionary prism in the Solomon Islands. MCS data, gravity data, and stratigraphic correlations to islands and ODP sites on the Ontong Java Plateau (OJP) reveal that the offshore MAP is composed of folded and thrust faulted sedimentary rocks and upper crystalline crust offscraped from the Solomon the subducting Ontong Java Plateau (Pacific plate) and transferred to the Solomon arc. With the exception of an upper, sequence of Quaternary? island-derived terrigenous sediments, the deformed stratigraphy of the MAP is identical to that of the incoming Ontong Java Plateau in the North Solomon trench. We divide the MAP into four distinct, folded and thrust fault-bounded structural domains interpreted to have formed by diachronous, southeast-to-northwest, and highly oblique entry of the Ontong Java Plateau into a former trench now marked by the Kia-Kaipito-Korigole (KKK) left-lateral strike-slip fault zone along the suture between the Solomon arc and the MAP. The structural style within each of the four structural domains consists of a parallel series of three to four fault propagation folds formed by the
May, Alex C W
It is often possible to identify sequence motifs that characterize a protein family in terms of its fold and/or function from aligned protein sequences. Such motifs can be used to search for new family members. Partitioning of sequence alignments into regions of similar amino acid variability is usually done by hand. Here, I present a completely automatic method for this purpose: one that is guaranteed to produce globally optimal solutions at all levels of partition granularity. The method is used to compare the tempo of sequence diversity across reliable three-dimensional (3D) structure-based alignments of 209 protein families (HOMSTRAD) and that for 69 superfamilies (CAMPASS). (The mean alignment length for HOMSTRAD and CAMPASS are very similar.) Surprisingly, the optimal segmentation distributions for the closely related proteins and distantly related ones are found to be very similar. Also, optimal segmentation identifies an unusual protein superfamily. Finally, protein 3D structure clues from the tempo of sequence diversity across alignments are examined. The method is general, and could be applied to any area of comparative biological sequence and 3D structure analysis where the constraint of the inherent linear organization of the data imposes an ordering on the set of objects to be clustered.
Newkirk, Daniel; Biesinger, Jacob; Chon, Alvin; Yokomori, Kyoko; Xie, Xiaohui
High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here we introduce a probabilistic approach for ChIP-Seq data analysis which utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem
Debe, Derek A; Danzer, Joseph F; Goddard, William A; Poleksic, Aleksandar
STRUCTFAST is a novel profile-profile alignment algorithm capable of detecting weak similarities between protein sequences. The increased sensitivity and accuracy of the STRUCTFAST method are achieved through several unique features. First, the algorithm utilizes a novel dynamic programming engine capable of incorporating important information from a structural family directly into the alignment process. Second, the algorithm employs a rigorous analytical formula for profile-profile scoring to overcome the limitations of ad hoc scoring functions that require adjustable parameter training. Third, the algorithm employs Convergent Island Statistics (CIS) to compute the statistical significance of alignment scores independently for each pair of sequences. STRUCTFAST routinely produces alignments that meet or exceed the quality obtained by an expert human homology modeler, as evidenced by its performance in the latest CAFASP4 and CASP6 blind prediction benchmark experiments.
Gallus, S; Kumar, V; Bertelsen, M F; Janke, A; Nilsson, M A
Ruminantia, the ruminating, hoofed mammals (cow, deer, giraffe and allies) are an unranked artiodactylan clade. Around 50-60 million years ago the BovB retrotransposon entered the ancestral ruminantian genome through horizontal gene transfer. A survey genome screen using 454-pyrosequencing of the Java mouse deer (Tragulus javanicus) and the lesser kudu (Tragelaphus imberbis) was done to investigate and to compare the landscape of transposable elements within Ruminantia. The family Tragulidae (mouse deer) is the only representative of Tragulina and phylogenetically important, because it represents the earliest divergence in Ruminantia. The data analyses show that, relative to other ruminantian species, the lesser kudu genome has seen an expansion of BovB Long INterspersed Elements (LINEs) and BovB related Short INterspersed Elements (SINEs) like BOVA2. In comparison the genome of Java mouse deer has fewer BovB elements than other ruminants, especially Bovinae, and has in addition a novel CHR-3 SINE most likely propagated by LINE-1. By contrast the other ruminants have low amounts of CHR SINEs but high numbers of actively propagating BovB-derived and BovB-propagated SINEs. The survey sequencing data suggest that the transposable element landscape in mouse deer (Tragulina) is unique among Ruminantia, suggesting a lineage specific evolutionary trajectory that does not involve BovB mediated retrotransposition. This shows that the genomic landscape of mobile genetic elements can rapidly change in any lineage. Copyright © 2015 Elsevier B.V. All rights reserved.
Liu, Wenning; Lin, Yaomin; Shi, Frank G.
In pigtailing of a single mode fiber to a semiconductor laser for optical communication applications, the tolerance for displacement of the fiber relative to the laser is extremely tight, a submicron movement can often lead to a significant misalignment and thus the reduction in the power coupled into the fiber. Among various fiber pigtailing assembly technologies, pulsed laser welding is the method with submicron accuracy and is most conducive to automation. However, the melting-solidification process during laser welding can often distort the pre-achieved fiber-optic alignment. This Welding-Induced-Alignment-Distortion (WIAD) is a serious concern and significantly affects the yield for single mode fiber pigtailing to a semiconductor laser. This work presents a method for predicting WIAD as a function of various processing, laser, tooling and materials parameters. More specifically, the degree of WIAD produced by the laser welding in a dual-in-line laser diode package is predicted for the first time. An optimal welding sequence is obtained for minimizing WIAD.
Aniba, Mohamed Radhouene; Poch, Olivier; Thompson, Julie D.
The post-genomic era presents many new challenges for the field of bioinformatics. Novel computational approaches are now being developed to handle the large, complex and noisy datasets produced by high throughput technologies. Objective evaluation of these methods is essential (i) to assure high quality, (ii) to identify strong and weak points of the algorithms, (iii) to measure the improvements introduced by new methods and (iv) to enable non-specialists to choose an appropriate tool. Here, we discuss the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field. We consider several criteria for building good benchmarks and the advantages to be gained when they are used intelligently. To illustrate these principles, we present a more detailed discussion of benchmarks for multiple alignments of protein sequences. As in many other domains, significant progress has been achieved in the multiple alignment field and the datasets have become progressively more challenging as the existing algorithms have evolved. Finally, we propose directions for future developments that will ensure that the bioinformatics benchmarks correspond to the challenges posed by the high throughput data. PMID:20639539
González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil
MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: email@example.comSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: firstname.lastname@example.org.
Decap, Dries; Fostier, Jan; Reumers, Joke
elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1:40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost. PMID:26182406
Zou, Quan; Hu, Qinghua; Guo, Maozu; Wang, Guohua
Multiple sequence alignment (MSA) is important work, but bottlenecks arise in the massive MSA of homologous DNA or genome sequences. Most of the available state-of-the-art software tools cannot address large-scale datasets, or they run rather slowly. The similarity of homologous DNA sequences is often ignored. Lack of parallelization is still a challenge for MSA research. We developed two software tools to address the DNA MSA problem. The first employed trie trees to accelerate the centre star MSA strategy. The expected time complexity was decreased to linear time from square time. To address large-scale data, parallelism was applied using the hadoop platform. Experiments demonstrated the performance of our proposed methods, including their running time, sum-of-pairs scores and scalability. Moreover, we supplied two massive DNA/RNA MSA datasets for further testing and research. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com.
Iwaszczuk, D.; Stilla, U.
Thermal infrared imagery of urban areas became interesting for urban climate investigations and thermal building inspections. Using a flying platform such as UAV or a helicopter for the acquisition and combining the thermal data with the 3D building models via texturing delivers a valuable groundwork for large-area building inspections. However, such thermal textures are useful for further analysis if they are geometrically correctly extracted. This can be achieved with a good coregistrations between the 3D building models and thermal images, which cannot be achieved by direct georeferencing. Hence, this paper presents methodology for alignment of 3D building models and oblique TIR image sequences taken from a flying platform. In a single image line correspondences between model edges and image line segments are found using accumulator approach and based on these correspondences an optimal camera pose is calculated to ensure the best match between the projected model and the image structures. Among the sequence the linear features are tracked based on visibility prediction. The results of the proposed methodology are presented using a TIR image sequence taken from helicopter in a densely built-up urban area. The novelty of this work is given by employing the uncertainty of the 3D building models and by innovative tracking strategy based on a priori knowledge from the 3D building model and the visibility checking.
Itskovich, Valeria; Gontcharov, Andrey; Masuda, Yoshiki; Nohno, Tsutomu; Belikov, Sergey; Efremova, Sofia; Meixner, Martin; Janussen, Dorte
Freshwater sponges include six extant families which belong to the suborder Spongillina (Porifera). The taxonomy of freshwater sponges is problematic and their phylogeny and evolution are not well understood. Sequences of the ribosomal internal transcribed spacers (ITS1 and ITS2) of 11 species from the family Lubomirskiidae, 13 species from the family Spongillidae, and 1 species from the family Potamolepidae were obtained to study the phylogenetic relationships between endemic and cosmopolitan freshwater sponges and the evolution of sponges in Lake Baikal. The present study is the first one where ITS1 sequences were successfully aligned using verified secondary structure models and, in combination with ITS2, used to infer relationships between the freshwater sponges. Phylogenetic trees inferred using maximum likelihood, neighbor-joining, and parsimony methods and Bayesian inference revealed that the endemic family Lubomirskiidae was monophyletic. Our results do not support the monophyly of Spongillidae because Lubomirskiidae formed a robust clade with E. muelleri, and Trochospongilla latouchiana formed a robust clade with the outgroup Echinospongilla brichardi (Potamolepidae). Within the cosmopolitan family Spongillidae the genera Radiospongilla and Eunapius were found to be monophyletic, while Ephydatia muelleri was basal to the family Lubomirskiidae. The genetic distances between Lubomirskiidae species being much lower than those between Spongillidae species are indicative of their relatively recent radiation from a common ancestor. These results indicated that rDNA spacers sequences can be useful in the study of phylogenetic relationships of and the identification of species of freshwater sponges.
Lestari, D.; Bustamam, A.; Novianti, T.; Ardaneswari, G.
DNA sequence can be defined as a succession of letters, representing the order of nucleotides within DNA, using a permutation of four DNA base codes including adenine (A), guanine (G), cytosine (C), and thymine (T). The precise code of the sequences is determined using DNA sequencing methods and technologies, which have been developed since the 1970s and currently become highly developed, advanced and highly throughput sequencing technologies. So far, DNA sequencing has greatly accelerated biological and medical research and discovery. However, in some cases DNA sequencing could produce any ambiguous and not clear enough sequencing results that make them quite difficult to be determined whether these codes are A, T, G, or C. To solve these problems, in this study we can introduce other representation of DNA codes namely Quaternion Q = (PA, PT, PG, PC), where PA, PT, PG, PC are the probability of A, T, G, C bases that could appear in Q and PA + PT + PG + PC = 1. Furthermore, using Quaternion representations we are able to construct the improved scoring matrix for global sequence alignment processes, by applying a dot product method. Moreover, this scoring matrix produces better and higher quality of the match and mismatch score between two DNA base codes. In implementation, we applied the Needleman-Wunsch global sequence alignment algorithm using Octave, to analyze our target sequence which contains some ambiguous sequence data. The subject sequences are the DNA sequences of Streptococcus pneumoniae families obtained from the Genebank, meanwhile the target DNA sequence are received from our collaborator database. As the results we found the Quaternion representations improve the quality of the sequence alignment score and we can conclude that DNA sequence target has maximum similarity with Streptococcus pneumoniae.
Yadav, Rohit Kumar; Banka, Haider
In bioinformatics, multiple sequence alignment (MSA) is an NP-hard problem. Hence, nature-inspired techniques can better approximate the solution. In the current study, a novel biogeography-based optimization (NBBO) is proposed to solve an MSA problem. The biogeography-based optimization (BBO) is a new paradigm for optimization. But, there exists some deficiencies in solving complicated problems such as low population diversity and slow convergence rate. NBBO is an enhanced version of BBO, in which, a new migration operation is proposed to overcome the limitations of BBO. The new migration adopts more information from other habitats, maintains population diversity, and preserves exploitation ability. In the performance analysis, the proposed and existing techniques such as VDGA, MOMSA, and GAPAM are tested on publicly available benchmark datasets (ie, Bali base). It has been observed that the proposed method shows the superiority/competitiveness with the existing techniques. PMID:27812276
Yadav, Rohit Kumar; Banka, Haider
In bioinformatics, multiple sequence alignment (MSA) is an NP-hard problem. Hence, nature-inspired techniques can better approximate the solution. In the current study, a novel biogeography-based optimization (NBBO) is proposed to solve an MSA problem. The biogeography-based optimization (BBO) is a new paradigm for optimization. But, there exists some deficiencies in solving complicated problems such as low population diversity and slow convergence rate. NBBO is an enhanced version of BBO, in which, a new migration operation is proposed to overcome the limitations of BBO. The new migration adopts more information from other habitats, maintains population diversity, and preserves exploitation ability. In the performance analysis, the proposed and existing techniques such as VDGA, MOMSA, and GAPAM are tested on publicly available benchmark datasets (ie, Bali base). It has been observed that the proposed method shows the superiority/competitiveness with the existing techniques.
Yang, Tao; Jia, Quanzhang; Guo, Hong; Xu, Jianzhong; Bai, Yun; Yang, Kai; Luo, Fei; Zhang, Zehua; Hou, Tianyong
To investigate the effects of genetic factors on idiopathic scoliosis (IS) and genetic modes through genetic epidemiological survey on IS in Chongqing City, China, and to determine whether SH3GL1, GADD45B, and FGF22 in the chromosome 19p13.3 are the pathogenic genes of IS through genetic sequence analysis. 214 nuclear families were investigated to analyse the age incidence, familial aggregation, and heritability. SH3GL1, GADD45B, and FGF22 were chosen as candidate genes for mutation screening in 56 IS patients of 214 families. The sequence alignment analysis was performed to determine mutations and predict the protein structure. The average age of onset of 10.8 years suggests that IS is a early onset disease. Incidences of IS in first-, second-, third-degree relatives and the overall incidence in families (5.68%) were also significantly higher than that of the general population (1.04%). The U test indicated a significant difference, suggesting that IS has a familial aggregation. The heritability of first-degree relatives (77.68 ±10.39%), second-degree relatives (69.89 ±3.14%), and third-degree relatives (62.14 ±11.92%) illustrated that genetic factors play an important role in IS pathogenesis. The incidence of first-degree relatives (10.01%), second-degree relatives (2.55%) and third-degree relatives (1.76%) illustrated that IS is not in simple accord with monogenic Mendel's law but manifests as traits of multifactorial hereditary diseases. Sequence alignment of exons of SH3GL1, GADD45B, and FGF22 showed 17 base mutations, of which 16 mutations do not induce open reading frame (ORF) shift or amino acid changes whereas one mutation (C→T)occurred in SH3GL1 results in formation of the termination codon, which induces variation of protein reading frame. Prediction analysis of protein sequence showed that the SH3GL1 mutant encoded a truncated protein, thus affecting the protein structure. IS is a multifactorial genetic disease and SH3GL1 may be one of the
Holland, R C G; Down, T A; Pocock, M; Prlić, A; Huen, D; James, K; Foisy, S; Dräger, A; Yates, A; Heuer, M; Schreiber, M J
BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.5 or higher. All queries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.
Miller, Joshua M; Moore, Stephen S; Stothard, Paul; Liao, Xiaoping; Coltman, David W
Whole genome sequences (WGS) have proliferated as sequencing technology continues to improve and costs decline. While many WGS of model or domestic organisms have been produced, a growing number of non-model species are also being sequenced. In the absence of a reference, construction of a genome sequence necessitates de novo assembly which may be beyond the ability of many labs due to the large volumes of raw sequence data and extensive bioinformatics required. In contrast, the presence of a reference WGS allows for alignment which is more tractable than assembly. Recent work has highlighted that the reference need not come from the same species, potentially enabling a wide array of species WGS to be constructed using cross-species alignment. Here we report on the creation a draft WGS from a single bighorn sheep (Ovis canadensis) using alignment to the closely related domestic sheep (Ovis aries). Two sequencing libraries on SOLiD platforms yielded over 865 million reads, and combined alignment to the domestic sheep reference resulted in a nearly complete sequence (95% coverage of the reference) at an average of 12x read depth (104 SD). From this we discovered over 15 million variants and annotated them relative to the domestic sheep reference. We then conducted an enrichment analysis of those SNPs showing fixed differences between the reference and sequenced individual and found significant differences in a number of gene ontology (GO) terms, including those associated with reproduction, muscle properties, and bone deposition. Our results demonstrate that cross-species alignment enables the creation of novel WGS for non-model organisms. The bighorn sheep WGS will provide a resource for future resequencing studies or comparative genomics.
Villard, Pierre; Malausa, Thibaut
SP-Designer is an open-source program providing a user-friendly tool for the design of specific PCR primer pairs from a DNA sequence alignment containing sequences from various taxa. SP-Designer selects PCR primer pairs for the amplification of DNA from a target species on the basis of several criteria: (i) primer specificity, as assessed by interspecific sequence polymorphism in the annealing regions, (ii) the biochemical characteristics of the primers and (iii) the intended PCR conditions. SP-Designer generates tables, detailing the primer pair and PCR characteristics, and a FASTA file locating the primer sequences in the original sequence alignment. SP-Designer is Windows-compatible and freely available from http://www2.sophia.inra.fr/urih/sophia_mart/sp_designer/info_sp_designer.php. © 2013 John Wiley & Sons Ltd.
Knight, Rob; Birmingham, Amanda; Yarus, Michael
BayesFold is a Web application that folds an alignment of closely related sequences and evaluates hypotheses about their shared structure. It uses Bayes's Theorem to combine information from several sources, including chemical mapping (if available), thermodynamic folding, and observed sequence variations. Its method provides a rational basis for integrating results, even when these methods conflict. On a gapped alignment of 86 tRNAPhe sequences each 77 bases long, BayesFold takes 31 sec to perform the calculations; the best structure contained 95% of the base pairs in the true structure, and the true structure was ranked second. Notably, similar results come from random samples of only 10 sequences from the alignment (running time 3 sec), suggesting that remarkably few sequences are required for good results. In contrast, folding single sequences with BayesFold produced structures 9.6 bp different, or with the Vienna package, 13.4 bp different, from the true structure. Similar results were obtained for other families of tRNAs. We especially recommend BayesFold for alignments of 3-50 closely related sequences, such as the sequence families frequently found in SELEX. In addition to providing a convenient way to explore the effects of each of the criteria on the plausibility of different structures, BayesFold also makes it easy to produce publication-quality secondary-structure graphics. The Web interface, available at http://bayes.colorado.edu/fold/, includes the flexibility to thread any of the sequences (or the consensus sequence) through any of the structures, including the one judged most probable.
Alinejad-Rokny, Hamid; Ebrahimi, Diako
The human genome encodes for a family of editing enzymes known as APOBEC3 (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3). They induce context dependent G-to-A changes, referred to as "hypermutation", in the genome of viruses such as HIV, SIV, HBV and endogenous retroviruses. Hypermutation is characterized by aligning affected sequences to a reference sequence. We show that indels (insertions/deletions) in the sequences lead to an incorrect assignment of APOBEC3 targeted and non-target sites. This can result in an incorrect identification of hypermutated sequences and erroneous biological inferences made based on hypermutation analysis.
Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions. Here, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment under a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general "substitution/insertion/deletion (SID) model". Using the operator representation of indels and the technique of time-dependent perturbation theory, we express the ab initio probability as a summation over all alignment-consistent indel histories. Exploiting the equivalence relations between different indel histories, we find a "sufficient and nearly necessary" set of conditions under which the probability can be factorized into the product of an overall factor and the contributions from regions separated by gapless columns of the alignment, thus providing a sort of generalized HMM. The conditions distinguish evolutionary models with
Mansur, Marco A B; Cardozo, Giovana P; Santos, Elaine V; Marins, Mozart
SNUFER is a software for the automatic localization and generation of tables used for the presentation of single nucleotide polymorphisms (SNPs). After input of a fasta file containing the sequences to be analyzed, a multiple sequence alignment is generated using ClustalW ran inside SNUFER. The ClustalW output file is then used to generate a table which displays the SNPs detected in the aligned sequences and their degree of similarity. This table can be exported to Microsoft Word, Microsoft Excel or as a single text file, permitting further editing for publication. The software was written using Delphi 7 for programming and FireBird 2.0 for sequence database management. It is freely available for noncommercial use and can be downloaded from http://www.heranza.com.br/bioinformatica2.htm. PMID:19238196
Bacci, G; Bazzicalupo, M; Benedetti, A; Mengoni, A
Next-generation sequencing technologies are extensively used in the field of molecular microbial ecology to describe taxonomic composition and to infer functionality of microbial communities. In particular, the so-called barcode or metagenetic applications that are based on PCR amplicon library sequencing are very popular at present. One of the problems, related to the utilization of the data of these libraries, is the analysis of reads quality and removal (trimming) of low-quality segments, while retaining sufficient information for subsequent analyses (e.g. taxonomic assignment). Here, we present StreamingTrim, a DNA reads trimming software, written in Java, with which researchers are able to analyse the quality of DNA sequences in fastq files and to search for low-quality zones in a very conservative way. This software has been developed with the aim to provide a tool capable of trimming amplicon library data, retaining as much as taxonomic information as possible. This software is equipped with a graphical user interface for a user-friendly usage. Moreover, from a computational point of view, StreamingTrim reads and analyses sequences one by one from an input fastq file, without keeping anything in memory, permitting to run the computation on a normal desktop PC or even a laptop. Trimmed sequences are saved in an output file, and a statistics summary is displayed that contains the mean and standard deviation of the length and quality of the whole sequence file. Compiled software, a manual and example data sets are available under the BSD-2-Clause License at the GitHub repository at https://github.com/GiBacci/StreamingTrim/.
Ackerman, Sharon H.; Tillier, Elisabeth R.; Gatti, Domenico L.
Background While the conserved positions of a multiple sequence alignment (MSA) are clearly of interest, non-conserved positions can also be important because, for example, destabilizing effects at one position can be compensated by stabilizing effects at another position. Different methods have been developed to recognize the evolutionary relationship between amino acid sites, and to disentangle functional/structural dependencies from historical/phylogenetic ones. Methodology/Principal Findings We have used two complementary approaches to test the efficacy of these methods. In the first approach, we have used a new program, MSAvolve, for the in silico evolution of MSAs, which records a detailed history of all covarying positions, and builds a global coevolution matrix as the accumulated sum of individual matrices for the positions forced to co-vary, the recombinant coevolution, and the stochastic coevolution. We have simulated over 1600 MSAs for 8 protein families, which reflect sequences of different sizes and proteins with widely different functions. The calculated coevolution matrices were compared with the coevolution matrices obtained for the same evolved MSAs with different coevolution detection methods. In a second approach we have evaluated the capacity of the different methods to predict close contacts in the representative X-ray structures of an additional 150 protein families using only experimental MSAs. Conclusions/Significance Methods based on the identification of global correlations between pairs were found to be generally superior to methods based only on local correlations in their capacity to identify coevolving residues using either simulated or experimental MSAs. However, the significant variability in the performance of different methods with different proteins suggests that the simulation of MSAs that replicate the statistical properties of the experimental MSA can be a valuable tool to identify the coevolution detection method that is most
Ackerman, Sharon H; Tillier, Elisabeth R; Gatti, Domenico L
While the conserved positions of a multiple sequence alignment (MSA) are clearly of interest, non-conserved positions can also be important because, for example, destabilizing effects at one position can be compensated by stabilizing effects at another position. Different methods have been developed to recognize the evolutionary relationship between amino acid sites, and to disentangle functional/structural dependencies from historical/phylogenetic ones. We have used two complementary approaches to test the efficacy of these methods. In the first approach, we have used a new program, MSAvolve, for the in silico evolution of MSAs, which records a detailed history of all covarying positions, and builds a global coevolution matrix as the accumulated sum of individual matrices for the positions forced to co-vary, the recombinant coevolution, and the stochastic coevolution. We have simulated over 1600 MSAs for 8 protein families, which reflect sequences of different sizes and proteins with widely different functions. The calculated coevolution matrices were compared with the coevolution matrices obtained for the same evolved MSAs with different coevolution detection methods. In a second approach we have evaluated the capacity of the different methods to predict close contacts in the representative X-ray structures of an additional 150 protein families using only experimental MSAs. Methods based on the identification of global correlations between pairs were found to be generally superior to methods based only on local correlations in their capacity to identify coevolving residues using either simulated or experimental MSAs. However, the significant variability in the performance of different methods with different proteins suggests that the simulation of MSAs that replicate the statistical properties of the experimental MSA can be a valuable tool to identify the coevolution detection method that is most effective in each case.
Wang, Ping; Hu, Lele; Liu, Guiyou; Jiang, Nan; Chen, Xiaoyun; Xu, Jianyong; Zheng, Wen; Li, Li; Tan, Ming; Chen, Zugen; Song, Hui; Cai, Yu-Dong; Chou, Kuo-Chen
Antimicrobial peptides (AMPs) represent a class of natural peptides that form a part of the innate immune system, and this kind of 'nature's antibiotics' is quite promising for solving the problem of increasing antibiotic resistance. In view of this, it is highly desired to develop an effective computational method for accurately predicting novel AMPs because it can provide us with more candidates and useful insights for drug design. In this study, a new method for predicting AMPs was implemented by integrating the sequence alignment method and the feature selection method. It was observed that, the overall jackknife success rate by the new predictor on a newly constructed benchmark dataset was over 80.23%, and the Mathews correlation coefficient is 0.73, indicating a good prediction. Moreover, it is indicated by an in-depth feature analysis that the results are quite consistent with the previously known knowledge that some amino acids are preferential in AMPs and that these amino acids do play an important role for the antimicrobial activity. For the convenience of most experimental scientists who want to use the prediction method without the interest to follow the mathematical details, a user-friendly web-server is provided at http://amp.biosino.org/.
Lewis, D F; Watson, E; Lake, B G
The evolution of the cytochrome P450 (CYP) superfamily is described, with particular reference to major events in the development of biological forms during geological time. It is noted that the currently accepted timescale for the elaboration of the P450 phylogenetic tree exhibits close parallels with the evolution of terrestrial biota. Indeed, the present human P450 complement of xenobiotic-metabolizing enzymes may have originated from coevolutionary 'warfare' between plants and animals during the Devonian period about 400 million years ago. A number of key correspondences between the evolution of P450 system and the course of biological development over time, point to a mechanistic molecular biology of evolution which is consistent with a steady increase in atmospheric oxygenation beginning over 2000 million years ago, whereas dietary changes during more recent geological time may provide one possible explanation for certain species differences in metabolism. Alignment between P450 protein sequences within the same family or subfamily, together with across-family comparisons, aid the rationalization of drug metabolism specificities for different P450 isoforms, and can assist in an understanding of genetic polymorphisms in P450-mediated oxidations at the molecular level. Moreover, the variation in P450 regulatory mechanisms and inducibilities between different mammalian species are likely to have important implications for current procedures of chemical safety evaluation, which rely on pure genetic strains of laboratory bred rodents for the testing of compounds destined for human exposure. Copyright 1998 Elsevier Science B.V. All rights reserved.
Huo, Hong-Wei; Stojkovic, Vojislav; Xie, Qiao-Luan
Quantum parallelism arises from the ability of a quantum memory register to exist in a superposition of base states. Since the number of possible base states is 2(n), where n is the number of qubits in the quantum memory register, one operation on a quantum computer performs what an exponential number of operations on a classical computer performs. The power of quantum algorithms comes from taking advantages of quantum parallelism. Quantum algorithms are exponentially faster than classical algorithms. Genetic optimization algorithms are stochastic search algorithms which are used to search large, nonlinear spaces where expert knowledge is lacking or difficult to encode. QGMALIGN--a probabilistic coding based quantum-inspired genetic algorithm for multiple sequence alignment is presented. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The experimental results show that QGMALIGN can compete with the popular methods, such as CLUSTALX and SAGA, and performs well on the presenting biological data. Moreover, the addition of genetic operators to the quantum-inspired algorithm lowers the cost of overall running time.
Vouzis, Panagiotis D.; Sahinidis, Nikolaos V.
Motivation: The Basic Local Alignment Search Tool (BLAST) is one of the most widely used bioinformatics tools. The widespread impact of BLAST is reflected in over 53 000 citations that this software has received in the past two decades, and the use of the word ‘blast’ as a verb referring to biological sequence comparison. Any improvement in the execution speed of BLAST would be of great importance in the practice of bioinformatics, and facilitate coping with ever increasing sizes of biomolecular databases. Results: Using a general-purpose graphics processing unit (GPU), we have developed GPU-BLAST, an accelerated version of the popular NCBI-BLAST. The implementation is based on the source code of NCBI-BLAST, thus maintaining the same input and output interface while producing identical results. In comparison to the sequential NCBI-BLAST, the speedups achieved by GPU-BLAST range mostly between 3 and 4. Availability: The source code of GPU-BLAST is freely available at http://archimedes.cheme.cmu.edu/biosoftware.html. Contact: firstname.lastname@example.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21088027
Cvicek, Vaclav; Goddard, William A.; Abrol, Ravinder
The understanding of G-protein coupled receptors (GPCRs) is undergoing a revolution due to increased information about their signaling and the experimental determination of structures for more than 25 receptors. The availability of at least one receptor structure for each of the GPCR classes, well separated in sequence space, enables an integrated superfamily-wide analysis to identify signatures involving the role of conserved residues, conserved contacts, and downstream signaling in the context of receptor structures. In this study, we align the transmembrane (TM) domains of all experimental GPCR structures to maximize the conserved inter-helical contacts. The resulting superfamily-wide GpcR Sequence-Structure (GRoSS) alignment of the TM domains for all human GPCR sequences is sufficient to generate a phylogenetic tree that correctly distinguishes all different GPCR classes, suggesting that the class-level differences in the GPCR superfamily are encoded at least partly in the TM domains. The inter-helical contacts conserved across all GPCR classes describe the evolutionarily conserved GPCR structural fold. The corresponding structural alignment of the inactive and active conformations, available for a few GPCRs, identifies activation hot-spot residues in the TM domains that get rewired upon activation. Many GPCR mutations, known to alter receptor signaling and cause disease, are located at these conserved contact and activation hot-spot residue positions. The GRoSS alignment places the chemosensory receptor subfamilies for bitter taste (TAS2R) and pheromones (Vomeronasal, VN1R) in the rhodopsin family, known to contain the chemosensory olfactory receptor subfamily. The GRoSS alignment also enables the quantification of the structural variability in the TM regions of experimental structures, useful for homology modeling and structure prediction of receptors. Furthermore, this alignment identifies structurally and functionally important residues in all human GPCRs
Shah, Hurmat Ali; Hasan, Laiq; Ahmad, Nasir
DNA sequence alignment is a cardinal process in computational biology but also is much expensive computationally when performing through traditional computational platforms like CPU. Of many off the shelf platforms explored for speeding up the computation process, FPGA stands as the best candidate due to its performance per dollar spent and performance per watt. These two advantages make FPGA as the most appropriate choice for realizing the aim of personal genomics. The previous implementation of DNA sequence alignment did not take into consideration the price of the device on which optimization was performed. This paper presents optimization over previous FPGA implementation that increases the overall speed-up achieved as well as the price incurred by the platform that was optimized. The optimizations are (1) The array of processing elements is made to run on change in input value and not on clock, so eliminating the need for tight clock synchronization, (2) the implementation is unrestrained by the size of the sequences to be aligned, (3) the waiting time required for the sequences to load to FPGA is reduced to the minimum possible and (4) an efficient method is devised to store the output matrix that make possible to save the diagonal elements to be used in next pass, in parallel with the computation of output matrix. Implemented on Spartan3 FPGA, this implementation achieved 20 times performance improvement in terms of CUPS over GPP implementation.
Esmaeilpour, Mansour; Naderifar, Vahideh; Shukur, Zarina
Over the last decade, design patterns have been used extensively to generate reusable solutions to frequently encountered problems in software engineering and object oriented programming. A design pattern is a repeatable software design solution that provides a template for solving various instances of a general problem. This paper describes a new method for pattern mining, isolating design patterns and relationship between them; and a related tool, DLA-DNA for all implemented pattern and all projects used for evaluation. DLA-DNA achieves acceptable precision and recall instead of other evaluated tools based on distributed learning automata (DLA) and deoxyribonucleic acid (DNA) sequences alignment. The proposed method mines structural design patterns in the object oriented source code and extracts the strong and weak relationships between them, enabling analyzers and programmers to determine the dependency rate of each object, component, and other section of the code for parameter passing and modular programming. The proposed model can detect design patterns better that available other tools those are Pinot, PTIDEJ and DPJF; and the strengths of their relationships. The result demonstrate that whenever the source code is build standard and non-standard, based on the design patterns, then the result of the proposed method is near to DPJF and better that Pinot and PTIDEJ. The proposed model is tested on the several source codes and is compared with other related models and available tools those the results show the precision and recall of the proposed method, averagely 20% and 9.6% are more than Pinot, 27% and 31% are more than PTIDEJ and 3.3% and 2% are more than DPJF respectively. The primary idea of the proposed method is organized in two following steps: the first step, elemental design patterns are identified, while at the second step, is composed to recognize actual design patterns.
Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu
High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments. PMID:28835734
Esmaeilpour, Mansour; Naderifar, Vahideh; Shukur, Zarina
Context Over the last decade, design patterns have been used extensively to generate reusable solutions to frequently encountered problems in software engineering and object oriented programming. A design pattern is a repeatable software design solution that provides a template for solving various instances of a general problem. Objective This paper describes a new method for pattern mining, isolating design patterns and relationship between them; and a related tool, DLA-DNA for all implemented pattern and all projects used for evaluation. DLA-DNA achieves acceptable precision and recall instead of other evaluated tools based on distributed learning automata (DLA) and deoxyribonucleic acid (DNA) sequences alignment. Method The proposed method mines structural design patterns in the object oriented source code and extracts the strong and weak relationships between them, enabling analyzers and programmers to determine the dependency rate of each object, component, and other section of the code for parameter passing and modular programming. The proposed model can detect design patterns better that available other tools those are Pinot, PTIDEJ and DPJF; and the strengths of their relationships. Results The result demonstrate that whenever the source code is build standard and non-standard, based on the design patterns, then the result of the proposed method is near to DPJF and better that Pinot and PTIDEJ. The proposed model is tested on the several source codes and is compared with other related models and available tools those the results show the precision and recall of the proposed method, averagely 20% and 9.6% are more than Pinot, 27% and 31% are more than PTIDEJ and 3.3% and 2% are more than DPJF respectively. Conclusion The primary idea of the proposed method is organized in two following steps: the first step, elemental design patterns are identified, while at the second step, is composed to recognize actual design patterns. PMID:25243670
Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu
High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
Detecting similarities between ligand binding sites in the absence of global homology between target proteins has been recognized as one of the critical components of modern drug discovery. Local binding site alignments can be constructed using sequence order-independent techniques, however, to achieve a high accuracy, many current algorithms for binding site comparison require high-quality experimental protein structures, preferably in the bound conformational state. This, in turn, complicates proteome scale applications, where only various quality structure models are available for the majority of gene products. To improve the state-of-the-art, we developed eMatchSite, a new method for constructing sequence order-independent alignments of ligand binding sites in protein models. Large-scale benchmarking calculations using adenine-binding pockets in crystal structures demonstrate that eMatchSite generates accurate alignments for almost three times more protein pairs than SOIPPA. More importantly, eMatchSite offers a high tolerance to structural distortions in ligand binding regions in protein models. For example, the percentage of correctly aligned pairs of adenine-binding sites in weakly homologous protein models is only 4–9% lower than those aligned using crystal structures. This represents a significant improvement over other algorithms, e.g. the performance of eMatchSite in recognizing similar binding sites is 6% and 13% higher than that of SiteEngine using high- and moderate-quality protein models, respectively. Constructing biologically correct alignments using predicted ligand binding sites in protein models opens up the possibility to investigate drug-protein interaction networks for complete proteomes with prospective systems-level applications in polypharmacology and rational drug repositioning. eMatchSite is freely available to the academic community as a web-server and a stand-alone software distribution at http://www.brylinski.org/ematchsite. PMID
Lentz, Christine A.
The purpose of this mixed method study was to examine the alignment of the written, enacted, and tested curricula of the Ocean City High School science course sequencing and its impact on student achievement. This study also examined the school's ability to predict student scores on the science portion of the High School Proficiency Assessment (HSPA). Data collected for science achievement included the science portion of the Grade Eight Proficiency Assessment (GEPA) as a pretest and the scores for the science portion of the HSPA as a posttest. Data collected for curriculum alignment included an examination of teacher generated course curriculum maps to determine the alignment with the New Jersey Core Curriculum Content Standards and the HSPA Test Specifications Directory. The quantitative data were treated through a series of paired samples t-tests, Pearson product moment correlation was used to examine relationships between variables, an ANCOVA analysis and a stepwise regression analysis were also completed. Based on the findings of the data analysis of this research effort, the following conclusions were drawn: (1) the alignment of the enacted curriculum with the tested and written curricula affected science achievement. (2) GEPA scores are significantly tied to HSPA scores and (3) GEPA scores and enrollment in the science sequence whose curriculum was aligned with the written and tested curricula, met the requirements of a predictor of scores on the HSPA exam. It is expected that educational leadership will use the results of this research to inform practice and drive decision-making in respect to student placement in to course sequences. It is hoped that the results will not only increase support for the district's curricula development plan but also add to the overall body of knowledge surrounding science program effectiveness in relation to the No Child Left Behind standards.
Boyce, Richard; Chilana, Parmit; Rose, Timothy M.
PCR amplification using COnsensus DEgenerate Hybrid Oligonucleotide Primers (CODEHOPs) has proven to be highly effective for identifying unknown pathogens and characterizing novel genes. We describe iCODEHOP; a new interactive web application that simplifies the process of designing and selecting CODEHOPs from multiply-aligned protein sequences. iCODEHOP intelligently guides the user through the degenerate primer design process including uploading sequences, creating a multiple alignment, deriving CODEHOPs and calculating their annealing temperatures. The user can quickly scan over an entire set of degenerate primers designed by the program to assess their relative quality and select individual primers for further analysis. The program displays phylogenetic information for input sequences and allows the user to easily design new primers from selected sequence sub-clades. It also allows the user to bias primer design to favor specific clades or sequences using sequence weights. iCODEHOP is freely available to all interested researchers at https://icodehop.cphi.washington.edu/i-codehop-context/Welcome. PMID:19443442
Nguyen, Tung; Shi, Weisong; Ruden, Douglas
Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http://cloudaligner.sourceforge.net/ and its web version is at http
Background Research in genetics has developed rapidly recently due to the aid of next generation sequencing (NGS). However, massively-parallel NGS produces enormous amounts of data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework, which utilizes hundreds or thousands of shared computers to map sequencing reads quickly and efficiently to reference genome sequences, appears to be a very promising solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc., in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the most recent second-generation and third-generation NGS instruments and, therefore, are inefficient. Last, it is difficult for a majority of biologists untrained in programming skills to use these tools because most were developed on Linux with a command line interface. Results To urge the trend of using Cloud technologies in genomics and prepare for advances in second- and third-generation DNA sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, is more accurate, and has a user-friendly interface. It was also designed to be able to deal with long sequences. The performance gain of CloudAligner over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based approaches, the performance gain of CloudAligner is from the partition and parallel processing of the huge reference genome as well as the reads. The source code of CloudAligner is available at http://cloudaligner.sourceforge.net/ and its web version
Pightling, Arthur W.; Petronella, Nicholas; Pagotto, Franco
The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should
Whole-genome sequencing is a powerful tool for analyzing genetic variation on a global scale. One particularly useful application is the identification of mutations obtained by classical phenotypic screens in model species. Sequence data from the mutant strain is aligned to the reference genome, and then variants are called to generate a list of candidate alleles. A number of software pipelines for mutation identification have been targeted to C. elegans, with particular emphasis on ease of use, incorporation of mapping strain data, subtraction of background variants, and similar criteria. Although success is predicated upon the sensitive and accurate detection of candidate alleles, relatively little effort has been invested in evaluating the underlying software components that are required for mutation identification. Therefore, we have benchmarked a number of commonly used tools for sequence alignment and variant calling, in all pair-wise combinations, against both simulated and actual datasets. We compared the accuracy of those pipelines for mutation identification in C. elegans, and found that the combination of BBMap for alignment plus FreeBayes for variant calling offers the most robust performance. PMID:28333980
Smith, Harold E; Yun, Sijung
Whole-genome sequencing is a powerful tool for analyzing genetic variation on a global scale. One particularly useful application is the identification of mutations obtained by classical phenotypic screens in model species. Sequence data from the mutant strain is aligned to the reference genome, and then variants are called to generate a list of candidate alleles. A number of software pipelines for mutation identification have been targeted to C. elegans, with particular emphasis on ease of use, incorporation of mapping strain data, subtraction of background variants, and similar criteria. Although success is predicated upon the sensitive and accurate detection of candidate alleles, relatively little effort has been invested in evaluating the underlying software components that are required for mutation identification. Therefore, we have benchmarked a number of commonly used tools for sequence alignment and variant calling, in all pair-wise combinations, against both simulated and actual datasets. We compared the accuracy of those pipelines for mutation identification in C. elegans, and found that the combination of BBMap for alignment plus FreeBayes for variant calling offers the most robust performance.
Mehta, P. K.; Heringa, J.; Argos, P.
To improve secondary structure predictions in protein sequences, the information residing in multiple sequence alignments of substituted but structurally related proteins is exploited. A database comprised of 70 protein families and a total of 2,500 sequences, some of which were aligned by tertiary structural superpositions, was used to calculate residue exchange weight matrices within alpha-helical, beta-strand, and coil substructures, respectively. Secondary structure predictions were made based on the observed residue substitutions in local regions of the multiple alignments and the largest possible associated exchange weights in each of the three matrix types. Comparison of the observed and predicted secondary structure on a per-residue basis yielded a mean accuracy of 72.2%. Individual alpha-helix, beta-strand, and coil states were respectively predicted at 66.7, and 75.8% correctness, representing a well-balanced three-state prediction. The accuracy level, verified by cross-validation through jack-knife tests on all protein families, dropped, on average, to only 70.9%, indicating the rigor of the prediction procedure. On the basis of robustness, conceptual clarity, accuracy, and executable efficiency, the method has considerable advantage, especially with its sole reliance on amino acid substitutions within structurally related proteins. PMID:8580842
Zhou, Jie; Zhong, Pianyu; Zhang, Tinghui
Determination of sequence similarity is one of the major steps in computational phylogenetic studies. One of the major tasks of computational biologists is to develop novel mathematical descriptors for similarity analysis. DNA clustering is an important technology that automatically identifies inherent relationships among large-scale DNA sequences. The comparison between the DNA sequences of different species helps determine phylogenetic relationships among species. Alignment-free approaches have continuously gained interest in various sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, particularly for large-scale sequence datasets. Here, we construct a novel and simple mathematical descriptor based on the characterization of cis sequence complex DNA networks. This new approach is based on a code of three cis nucleotides in a gene that could code for an amino acid. In particular, for each DNA sequence, we will set up a cis sequence complex network that will be used to develop a characterization vector for the analysis of mitochondrial DNA sequence phylogenetic relationships among nine species. The resulting phylogenetic relationships among the nine species were determined to be in agreement with the actual situation.
Zhou, Jie; Zhong, Pianyu; Zhang, Tinghui
Determination of sequence similarity is one of the major steps in computational phylogenetic studies. One of the major tasks of computational biologists is to develop novel mathematical descriptors for similarity analysis. DNA clustering is an important technology that automatically identifies inherent relationships among large-scale DNA sequences. The comparison between the DNA sequences of different species helps determine phylogenetic relationships among species. Alignment-free approaches have continuously gained interest in various sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, particularly for large-scale sequence datasets. Here, we construct a novel and simple mathematical descriptor based on the characterization of cis sequence complex DNA networks. This new approach is based on a code of three cis nucleotides in a gene that could code for an amino acid. In particular, for each DNA sequence, we will set up a cis sequence complex network that will be used to develop a characterization vector for the analysis of mitochondrial DNA sequence phylogenetic relationships among nine species. The resulting phylogenetic relationships among the nine species were determined to be in agreement with the actual situation. PMID:27746676
Criel, Jo; Tsiporkova, Elena
An application tool for alignment, template matching and visualization of gene expression time series is presented. The core algorithm is based on dynamic time warping techniques used in the speech recognition field. These techniques allow for non-linear (elastic) alignment of temporal sequences of feature vectors and consequently enable detection of similar shapes with different phases. The Java program, examples and a tutorial are available at http://www.psb.ugent.be/cbd/papers/gentxwarper/
JavaParty ..............................................................................................9 2. Manta ...required to write a parallel Java program. 2. Manta Manta  is a native Java compiler that compiles Java source codes to x86 executables with a...competitive goal to be faster than other current Java implementations, such as JavaParty. Although Manta uses a “highly efficient” RMI implementation, it
Ortuño, Francisco M; Valenzuela, Olga; Rojas, Fernando; Pomares, Hector; Florido, Javier P; Urquiza, Jose M; Rojas, Ignacio
Multiple sequence alignments (MSAs) are widely used approaches in bioinformatics to carry out other tasks such as structure predictions, biological function analyses or phylogenetic modeling. However, current tools usually provide partially optimal alignments, as each one is focused on specific biological features. Thus, the same set of sequences can produce different alignments, above all when sequences are less similar. Consequently, researchers and biologists do not agree about which is the most suitable way to evaluate MSAs. Recent evaluations tend to use more complex scores including further biological features. Among them, 3D structures are increasingly being used to evaluate alignments. Because structures are more conserved in proteins than sequences, scores with structural information are better suited to evaluate more distant relationships between sequences. The proposed multiobjective algorithm, based on the non-dominated sorting genetic algorithm, aims to jointly optimize three objectives: STRIKE score, non-gaps percentage and totally conserved columns. It was significantly assessed on the BAliBASE benchmark according to the Kruskal-Wallis test (P < 0.01). This algorithm also outperforms other aligners, such as ClustalW, Multiple Sequence Alignment Genetic Algorithm (MSA-GA), PRRP, DIALIGN, Hidden Markov Model Training (HMMT), Pattern-Induced Multi-sequence Alignment (PIMA), MULTIALIGN, Sequence Alignment Genetic Algorithm (SAGA), PILEUP, Rubber Band Technique Genetic Algorithm (RBT-GA) and Vertical Decomposition Genetic Algorithm (VDGA), according to the Wilcoxon signed-rank test (P < 0.05), whereas it shows results not significantly different to 3D-COFFEE (P > 0.05) with the advantage of being able to use less structures. Structural information is included within the objective function to evaluate more accurately the obtained alignments. The source code is available at http://www.ugr.es/~fortuno/MOSAStrE/MO-SAStrE.zip.
Etherington, Graham J; Ramirez-Gonzalez, Ricardo H; MacLean, Dan
bio-samtools is a Ruby language interface to SAMtools, the highly popular library that provides utilities for manipulating high-throughput sequence alignments in the Sequence Alignment/Map format. Advances in Ruby, now allow us to improve the analysis capabilities and increase bio-samtools utility, allowing users to accomplish a large amount of analysis using a very small amount of code. bio-samtools can also be easily developed to include additional SAMtools methods and hence stay current with the latest SAMtools releases. We have added new Ruby classes for the MPileup and Variant Call Format (VCF) data formats emitted by SAMtools and introduced more analysis methods for variant analysis, including alternative allele calculation and allele frequency calling for SNPs. Our new implementation of bio-samtools also ensures that all the functionality of the SAMtools library is now supported and that bio-samtools can be easily extended to include future changes in SAMtools. bio-samtools 2 also provides methods that allow the user to directly produce visualization of alignment data. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com.
Notredame, Cedric [Centre for Genomic Regulation
Cedric Notredame from the Centre for Genomic Regulation gives a presentation on "New Challenges of the Computation of Multiple Sequence Alignments in the High-Throughput Era" at the JGI/Argonne HPC Workshop on January 26, 2010.
Djakpa, Helene; Kulkarni, Aditya; Barrows-Murphy, Scheneque; Miller, Greg; Zhou, Weihong; Cho, Hyejin; Török, Béla; Stieglitz, Kimberly
Phospholipase D enzymes cleave phospholipid substrates generating choline and phosphatidic acid. Phospholipase D from Streptomyces chromofuscus is a non-HKD (histidine, lysine, and aspartic acid) phospholipase D as the enzyme is more similar to members of the diverse family of metallo-phosphodiesterase/phosphatase enzymes than phospholipase D enzymes with active site HKD repeats. A highly efficient library of phospholipase D inhibitors based on 1,3-disubstituted-4-amino-pyrazolopyrimidine core structure was utilized to evaluate the inhibition of purified S. chromofuscus phospholipase D. The molecules exhibited inhibition of phospholipase D activity (IC50 ) in the nanomolar range with monomeric substrate diC4 PC and micromolar range with phospholipid micelles and vesicles. Binding studies with vesicle substrate and phospholipase D strongly indicate that these inhibitors directly block enzyme vesicle binding. Following these compelling results as a starting point, sequence searches and alignments with S. chromofuscus phospholipase D have identified potential new drug targets. Using AutoDock, inhibitors were docked into the enzymes selected from sequence searches and alignments (when 3D co-ordinates were available) and results analyzed to develop next-generation inhibitors for new targets. In vitro enzyme activity assays with several human phosphatases demonstrated that the predictive protocol was accurate. The strategy of combining sequence comparison, docking, and high-throughput screening assays has helped to identify new drug targets and provided some insight into how to make potential inhibitors more specific to desired targets.
Kapp, O. H.; Moens, L.; Vanfleteren, J.; Trotman, C. N.; Suzuki, T.; Vinogradov, S. N.
Seven-hundred globin sequences, including 146 nonvertebrate sequences, were aligned on the basis of conservation of secondary structure and the avoidance of gap penalties. Of the 182 positions needed to accommodate all the globin sequences, only 84 are common to all, including the absolutely conserved PheCD1 and HisF8. The mean number of amino acid substitutions per position ranges from 8 to 13 for all globins and 5 to 9 for internal positions. Although the total sequence volumes have a variation approximately 2-3%, the variation in volume per position ranges from approximately 13% for the internal to approximately 21% for the surface positions. Plausible correlations exist between amino acid substitution and the variation in volume per position for the 84 common and the internal but not the surface positions. The amino acid substitution matrix derived from the 84 common positions was used to evaluate sequence similarity within the globins and between the globins and phycocyanins C and colicins A, via calculation of pairwise similarity scores. The scores for globin-globin comparisons over the 84 common positions overlap the globin-phycocyanin and globin-colicin scores, with the former being intermediate. For the subset of internal positions, overlap is minimal between the three groups of scores. These results imply a continuum of amino acid sequences able to assume the common three-on-three alpha-helical structure and suggest that the determinants of the latter include sites other than those inaccessible to solvent. PMID:8535255
Yamada, Kazunori D.; Tomii, Kentaro; Katoh, Kazutaka
Motivation: Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones. Results: We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use. Availability and Implementation: http://mafft.cbrc.jp/alignment/software/ Contact: firstname.lastname@example.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27378296
Pierri, Ciro Leonardo; Parisi, Giovanni; Porcelli, Vito
The functional characterization of proteins represents a daily challenge for biochemical, medical and computational sciences. Although finally proved on the bench, the function of a protein can be successfully predicted by computational approaches that drive the further experimental assays. Current methods for comparative modeling allow the construction of accurate 3D models for proteins of unknown structure, provided that a crystal structure of a homologous protein is available. Binding regions can be proposed by using binding site predictors, data inferred from homologous crystal structures, and data provided from a careful interpretation of the multiple sequence alignment of the investigated protein and its homologs. Once the location of a binding site has been proposed, chemical ligands that have a high likelihood of binding can be identified by using ligand docking and structure-based virtual screening of chemical libraries. Most docking algorithms allow building a list sorted by energy of the lowest energy docking configuration for each ligand of the library. In this review the state-of-the-art of computational approaches in 3D protein comparative modeling and in the study of protein-ligand interactions is provided. Furthermore a possible combined/concerted multistep strategy for protein function prediction, based on multiple sequence alignment, comparative modeling, binding region prediction, and structure-based virtual screening of chemical libraries, is described by using suitable examples. As practical examples, Abl-kinase molecular modeling studies, HPV-E6 protein multiple sequence alignment analysis, and some other model docking-based characterization reports are briefly described to highlight the importance of computational approaches in protein function prediction.
Gonçalves, Joana P; Moreau, Yves; Madeira, Sara C
Transcription Factors (TFs) control transcription by binding to specific sites in the promoter regions of the target genes, which can be modelled by structured motifs. In this paper we propose AliBiMotif, a method combining sequence alignment and a biclustering approach based on efficient string matching techniques using suffix trees to unravel approximately conserved sets of blocks (structured motifs) while straightforwardly disregarding non-conserved stretches in-between. The ability to ignore the width of non-conserved regions is a major advantage of the proposed method over other motif finders, as the lengths of the binding sites are usually easier to estimate than the separating distances.
Gillespie, Joseph J; Yoder, Matthew J; Wharton, Robert A
We utilize the secondary structural properties of the 28S rRNA D2-D10 expansion segments to hypothesize a multiple sequence alignment for major lineages of the hymenopteran superfamily Ichneumonoidea (Braconidae, Ichneumonidae). The alignment consists of 290 sequences (originally analyzed in Belshaw and Quicke, Syst Biol 51:450-477, 2002) and provides the first global alignment template for this diverse group of insects. Predicted structures for these expansion segments as well as for over half of the 18S rRNA are given, with highly variable regions characterized and isolated within conserved structures. We demonstrate several pitfalls of optimization alignment and illustrate how these are potentially addressed with structure-based alignments. Our global alignment is presented online at (http://hymenoptera.tamu.edu/rna) with summary statistics, such as basepair frequency tables, along with novel tools for parsing structure-based alignments into input files for most commonly used phylogenetic software. These resources will be valuable for hymenopteran systematists, as well as researchers utilizing rRNA sequences for phylogeny estimation in any taxon. We explore the phylogenetic utility of our structure-based alignment by examining a subset of the data under a variety of optimality criteria using results from Belshaw and Quicke (2002) as a benchmark.
Phinney, Eric J.; Mann, Paul; Coffin, Millard F.; Shipley, Thomas H.
The Ontong Java Plateau (OJP) is the largest and thickest oceanic plateau on Earth and one of the few oceanic plateaus actively converging on an island arc. We present velocity determinations and geologic interpretation of 2000 km of two-dimensional, multi-channel seismic data from the southwestern Ontong Java Plateau, North Solomon Trench, and northern Solomon Islands. We recognize three megasequences, ranging in age from early Cretaceous to Quaternary, on the basis of distinct interval velocities and seismic stratigraphic facies. Megasequence OJ1 is early Cretaceous, upper igneous crust of the OJP and correlates with basalt outcrops dated at 122-125 Ma on the island of Malaita. The top of the overlying megasequence OJ2, a late Cretaceous mudstone unit, had been identified by previous workers as the top of igneous basement. Seismic facies and correlation to distant Deep Sea Drilling Project/Ocean Drilling Program sites indicate that OJ2 was deposited in a moderately low-energy, marine environment near a fluctuating carbonate compensation depth that resulted in multiple periods of dissolution. OJ2 thins south of the Stewart Arch onto the Solomon Islands where it is correlated with the Kwaraae Mudstone Formation. Megasequence OJ3 is late Cretaceous through Quaternary pelagic cover which caps the Ontong Java Plateau; it thickens into the North Solomon Trench, and seismic facies suggest that OJ3 was deposited in a low-energy marine environment. We use seismic facies analysis, sediment thickness, structural observations, and quantitative plate reconstructions of the position of the OJP and Solomon Islands to propose a tectonic, magmatic, and sedimentary history of the southwestern Ontong Java Plateau. Prior to 125 Ma late Jurassic and early Cretaceous oceanic crust formed. From 125 to 122 Ma, the first mantle plume formed igneous crust (OJ1). Between 122 and 92 Ma, marine mudstone (OJ2 and Kwaraae mudstone of Malaita, Solomon Islands) was deposited on Ontong Java
Little attention has been given to failed, poorly-performing, and non-polymorphic expressed sequence tag (EST) simple sequence repeat (SSR) primers. This is due in part to a lack of interest and value in reporting them but also because of the difficulty in addressing the causes of failure on a prime...
Barry, Matthew R.; Osborne, Richard N.
A computer program called "Rational Sequence" generates Universal Modeling Language (UML) sequence diagrams of a target Java program running on a Java virtual machine (JVM). Rational Sequence thereby performs a reverse engineering function that aids in the design documentation of the target Java program. Whereas previously, the construction of sequence diagrams was a tedious manual process, Rational Sequence generates UML sequence diagrams automatically from the running Java code.
Salavert Torres, José; Blanquer Espert, Ignacio; Domínguez, Andrés Tomás; Hernández García, Vicente; Medina Castelló, Ignacio; Tárraga Giménez, Joaquín; Dopazo Blázquez, Joaquín
General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory.
Chen, Yue; Chen, Wei; Cobb, Melanie H; Zhao, Yingming
We present sequence alignment software, called PTMap, for the accurate identification of full-spectrum protein post-translational modifications (PTMs) and polymorphisms. The software incorporates several features to improve searching speed and accuracy, including peak selection, adjustment of inaccurate mass shifts, and precise localization of PTM sites. PTMap also automates rules, based mainly on unmatched peaks, for manual verification of identified peptides. To evaluate the quality of sequence alignment, we developed a scoring system that takes into account both matched and unmatched peaks in the mass spectrum. Incorporation of these features dramatically increased both accuracy and sensitivity of the peptide- and PTM-identifications. To our knowledge, PTMap is the first algorithm that emphasizes unmatched peaks to eliminate false positives. The superior performance and reliability of PTMap were demonstrated by confident identification of PTMs on 156 peptides from four proteins and validated by MS/MS of the synthetic peptides. Our results demonstrate that PTMap is a powerful algorithm capable of identification of all possible protein PTMs with high confidence.
SeqAPASS is a software application facilitates rapid and streamlined, yet transparent, comparisons of the similarity of toxicologically-significant molecular targets across species. The present application facilitates analysis of primary amino acid sequence similarity (including ...
SeqAPASS is a software application facilitates rapid and streamlined, yet transparent, comparisons of the similarity of toxicologically-significant molecular targets across species. The present application facilitates analysis of primary amino acid sequence similarity (including ...
Lemey, Philippe; Lott, Martin; Martin, Darren P; Moulton, Vincent
Background Recombination has a profound impact on the evolution of viruses, but characterizing recombination patterns in molecular sequences remains a challenging endeavor. Despite its importance in molecular evolutionary studies, identifying the sequences that exhibit such patterns has received comparatively less attention in the recombination detection framework. Here, we extend a quartet-mapping based recombination detection method to enable identification of recombinant sequences without prior specifications of either query and reference sequences. Through simulations we evaluate different recombinant identification statistics and significance tests. We compare the quartet approach with triplet-based methods that employ additional heuristic tests to identify parental and recombinant sequences. Results Analysis of phylogenetic simulations reveal that identifying the descendents of relatively old recombination events is a challenging task for all methods available, and that quartet scanning performs relatively well compared to the triplet based methods. The use of quartet scanning is further demonstrated by analyzing both well-established and putative HIV-1 recombinant strains. In agreement with recent findings, we provide evidence that the presumed circulating recombinant CRF02_AG is a 'pure' lineage, whereas the presumed parental lineage subtype G has a recombinant origin. We also demonstrate HIV-1 intrasubtype recombination, confirm the hybrid origin of SIV in chimpanzees and further disentangle the recombinant history of SIV lineages in a primate immunodeficiency virus data set. Conclusion Quartet scanning makes a valuable addition to triplet-based methods for identifying recombinant sequences without prior specifications of either query and reference sequences. The new method is available in the VisRD v.3.0 package . PMID:19397803
Wolfsheimer, Stefan; Herms, Inke; Rahmann, Sven; Hartmann, Alexander K
Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. The results illustrate that the sensitivity and specificity strongly depend on the
Background Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions The results illustrate that the sensitivity and specificity
Lee, Aaron Y; Lee, Cecilia S; Van Gelder, Russell N
Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results. After retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 10(11) base pairs of 3.3 × 10(6) sequences from 9.25 × 10(5) species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 10(6) reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK. SMART is the first scalable, efficient, and rapid metagenomics classification algorithm
Piatkowski, Pawel; Jablonska, Jagoda; Zyla, Adriana; Niedzialek, Dorota; Matelska, Dorota; Jankowska, Elzbieta; Walen, Tomasz; Dawson, Wayne K; Bujnicki, Janusz M
RNA has been found to play an ever-increasing role in a variety of biological processes. The function of most non-coding RNA molecules depends on their structure. Comparing and classifying macromolecular 3D s