Sample records for motif discovery algorithm

  1. A generic motif discovery algorithm for sequential data.

    PubMed

    Jensen, Kyle L; Styczynski, Mark P; Rigoutsos, Isidore; Stephanopoulos, Gregory N

    2006-01-01

    Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures. Gemoda is freely available at http://web.mit.edu/bamel/gemoda

  2. BayesMotif: de novo protein sorting motif discovery from impure datasets.

    PubMed

    Hu, Jianjun; Zhang, Fan

    2010-01-18

    Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.

  3. Limitations and potentials of current motif discovery algorithms

    PubMed Central

    Hu, Jianjun; Li, Bin; Kihara, Daisuke

    2005-01-01

    Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6–45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them. PMID:16284194

  4. Classification and assessment tools for structural motif discovery algorithms.

    PubMed

    Badr, Ghada; Al-Turaiki, Isra; Mathkour, Hassan

    2013-01-01

    Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery. Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures. We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.

  5. A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs

    PubMed Central

    2012-01-01

    Background Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research. Results We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature. Conclusions Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/. PMID:23181585

  6. IndeCut evaluates performance of network motif discovery algorithms.

    PubMed

    Ansariola, Mitra; Megraw, Molly; Koslicki, David

    2018-05-01

    Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systematic way of achieving this goal. However, a reliable network motif discovery outcome requires generating random background networks which are the result of a uniform and independent graph sampling method. To date, there has been no method to numerically evaluate whether any network motif discovery algorithm performs as intended on realistically sized datasets-thus it was not possible to assess the validity of resulting network motifs. In this work, we present IndeCut, the first method to date that characterizes network motif finding algorithm performance in terms of uniform sampling on realistically sized networks. We demonstrate that it is critical to use IndeCut prior to running any network motif finder for two reasons. First, IndeCut indicates the number of samples needed for a tool to produce an outcome that is both reproducible and accurate. Second, IndeCut allows users to choose the tool that generates samples in the most independent fashion for their network of interest among many available options. The open source software package is available at https://github.com/megrawlab/IndeCut. megrawm@science.oregonstate.edu or david.koslicki@math.oregonstate.edu. Supplementary data are available at Bioinformatics online.

  7. info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling.

    PubMed

    Defrance, Matthieu; van Helden, Jacques

    2009-10-15

    Discovering cis-regulatory elements in genome sequence remains a challenging issue. Several methods rely on the optimization of some target scoring function. The information content (IC) or relative entropy of the motif has proven to be a good estimator of transcription factor DNA binding affinity. However, these information-based metrics are usually used as a posteriori statistics rather than during the motif search process itself. We introduce here info-gibbs, a Gibbs sampling algorithm that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the motif while keeping computation time low. The method compares well with existing methods like MEME, BioProspector, Gibbs or GAME on both synthetic and biological datasets. Our study shows that motif discovery techniques can be enhanced by directly focusing the search on the motif IC or the motif LLR. http://rsat.ulb.ac.be/rsat/info-gibbs

  8. CombiMotif: A new algorithm for network motifs discovery in protein-protein interaction networks

    NASA Astrophysics Data System (ADS)

    Luo, Jiawei; Li, Guanghui; Song, Dan; Liang, Cheng

    2014-12-01

    Discovering motifs in protein-protein interaction networks is becoming a current major challenge in computational biology, since the distribution of the number of network motifs can reveal significant systemic differences among species. However, this task can be computationally expensive because of the involvement of graph isomorphic detection. In this paper, we present a new algorithm (CombiMotif) that incorporates combinatorial techniques to count non-induced occurrences of subgraph topologies in the form of trees. The efficiency of our algorithm is demonstrated by comparing the obtained results with the current state-of-the art subgraph counting algorithms. We also show major differences between unicellular and multicellular organisms. The datasets and source code of CombiMotif are freely available upon request.

  9. FoldMiner and LOCK 2: protein structure comparison and motif discovery on the web.

    PubMed

    Shapiro, Jessica; Brutlag, Douglas

    2004-07-01

    The FoldMiner web server (http://foldminer.stanford.edu/) provides remote access to methods for protein structure alignment and unsupervised motif discovery. FoldMiner is unique among such algorithms in that it improves both the motif definition and the sensitivity of a structural similarity search by combining the search and motif discovery methods and using information from each process to enhance the other. In a typical run, a query structure is aligned to all structures in one of several databases of single domain targets in order to identify its structural neighbors and to discover a motif that is the basis for the similarity among the query and statistically significant targets. This process is fully automated, but options for manual refinement of the results are available as well. The server uses the Chime plugin and customized controls to allow for visualization of the motif and of structural superpositions. In addition, we provide an interface to the LOCK 2 algorithm for rapid alignments of a query structure to smaller numbers of user-specified targets.

  10. Biological network motif detection and evaluation

    PubMed Central

    2011-01-01

    Background Molecular level of biological data can be constructed into system level of data as biological networks. Network motifs are defined as over-represented small connected subgraphs in networks and they have been used for many biological applications. Since network motif discovery involves computationally challenging processes, previous algorithms have focused on computational efficiency. However, we believe that the biological quality of network motifs is also very important. Results We define biological network motifs as biologically significant subgraphs and traditional network motifs are differentiated as structural network motifs in this paper. We develop five algorithms, namely, EDGEGO-BNM, EDGEBETWEENNESS-BNM, NMF-BNM, NMFGO-BNM and VOLTAGE-BNM, for efficient detection of biological network motifs, and introduce several evaluation measures including motifs included in complex, motifs included in functional module and GO term clustering score in this paper. Experimental results show that EDGEGO-BNM and EDGEBETWEENNESS-BNM perform better than existing algorithms and all of our algorithms are applicable to find structural network motifs as well. Conclusion We provide new approaches to finding network motifs in biological networks. Our algorithms efficiently detect biological network motifs and further improve existing algorithms to find high quality structural network motifs, which would be impossible using existing algorithms. The performances of the algorithms are compared based on our new evaluation measures in biological contexts. We believe that our work gives some guidelines of network motifs research for the biological networks. PMID:22784624

  11. A private DNA motif finding algorithm.

    PubMed

    Chen, Rui; Peng, Yun; Choi, Byron; Xu, Jianliang; Hu, Haibo

    2014-08-01

    With the increasing availability of genomic sequence data, numerous methods have been proposed for finding DNA motifs. The discovery of DNA motifs serves a critical step in many biological applications. However, the privacy implication of DNA analysis is normally neglected in the existing methods. In this work, we propose a private DNA motif finding algorithm in which a DNA owner's privacy is protected by a rigorous privacy model, known as ∊-differential privacy. It provides provable privacy guarantees that are independent of adversaries' background knowledge. Our algorithm makes use of the n-gram model and is optimized for processing large-scale DNA sequences. We evaluate the performance of our algorithm over real-life genomic data and demonstrate the promise of integrating privacy into DNA motif finding. Copyright © 2014 Elsevier Inc. All rights reserved.

  12. Efficient exact motif discovery.

    PubMed

    Marschall, Tobias; Rahmann, Sven

    2009-06-15

    The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/.

  13. GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns

    DOE PAGES

    Senin, Pavel; Lin, Jessica; Wang, Xing; ...

    2018-02-23

    The problems of recurrent and anomalous pattern discovery in time series, e.g., motifs and discords, respectively, have received a lot of attention from researchers in the past decade. However, since the pattern search space is usually intractable, most existing detection algorithms require that the patterns have discriminative characteristics and have its length known in advance and provided as input, which is an unreasonable requirement for many real-world problems. In addition, patterns of similar structure, but of different lengths may co-exist in a time series. In order to address these issues, we have developed algorithms for variable-length time series pattern discoverymore » that are based on symbolic discretization and grammar inference—two techniques whose combination enables the structured reduction of the search space and discovery of the candidate patterns in linear time. In this work, we present GrammarViz 3.0—a software package that provides implementations of proposed algorithms and graphical user interface for interactive variable-length time series pattern discovery. The current version of the software provides an alternative grammar inference algorithm that improves the time series motif discovery workflow, and introduces an experimental procedure for automated discretization parameter selection that builds upon the minimum cardinality maximum cover principle and aids the time series recurrent and anomalous pattern discovery.« less

  14. GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Senin, Pavel; Lin, Jessica; Wang, Xing

    The problems of recurrent and anomalous pattern discovery in time series, e.g., motifs and discords, respectively, have received a lot of attention from researchers in the past decade. However, since the pattern search space is usually intractable, most existing detection algorithms require that the patterns have discriminative characteristics and have its length known in advance and provided as input, which is an unreasonable requirement for many real-world problems. In addition, patterns of similar structure, but of different lengths may co-exist in a time series. In order to address these issues, we have developed algorithms for variable-length time series pattern discoverymore » that are based on symbolic discretization and grammar inference—two techniques whose combination enables the structured reduction of the search space and discovery of the candidate patterns in linear time. In this work, we present GrammarViz 3.0—a software package that provides implementations of proposed algorithms and graphical user interface for interactive variable-length time series pattern discovery. The current version of the software provides an alternative grammar inference algorithm that improves the time series motif discovery workflow, and introduces an experimental procedure for automated discretization parameter selection that builds upon the minimum cardinality maximum cover principle and aids the time series recurrent and anomalous pattern discovery.« less

  15. Discovery of phosphorylation motif mixtures in phosphoproteomics data

    PubMed Central

    Ritz, Anna; Shakhnarovich, Gregory; Salomon, Arthur R.; Raphael, Benjamin J.

    2009-01-01

    Motivation: Modification of proteins via phosphorylation is a primary mechanism for signal transduction in cells. Phosphorylation sites on proteins are determined in part through particular patterns, or motifs, present in the amino acid sequence. Results: We describe an algorithm that simultaneously discovers multiple motifs in a set of peptides that were phosphorylated by several different kinases. Such sets of peptides are routinely produced in proteomics experiments.Our motif-finding algorithm uses the principle of minimum description length to determine a mixture of sequence motifs that distinguish a foreground set of phosphopeptides from a background set of unphosphorylated peptides. We show that our algorithm outperforms existing motif-finding algorithms on synthetic datasets consisting of mixtures of known phosphorylation sites. We also derive a motif specificity score that quantifies whether or not the phosphoproteins containing an instance of a motif have a significant number of known interactions. Application of our motif-finding algorithm to recently published human and mouse proteomic studies recovers several known phosphorylation motifs and reveals a number of novel motifs that are enriched for interactions with a particular kinase or phosphatase. Our tools provide a new approach for uncovering the sequence specificities of uncharacterized kinases or phosphatases. Availability: Software is available at http:/cs.brown.edu/people/braphael/software.html. Contact: aritz@cs.brown.edu; braphael@cs.brown.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:18996944

  16. The effect of orthology and coregulation on detecting regulatory motifs.

    PubMed

    Storms, Valerie; Claeys, Marleen; Sanchez, Aminael; De Moor, Bart; Verstuyf, Annemieke; Marchal, Kathleen

    2010-02-03

    Computational de novo discovery of transcription factor binding sites is still a challenging problem. The growing number of sequenced genomes allows integrating orthology evidence with coregulation information when searching for motifs. Moreover, the more advanced motif detection algorithms explicitly model the phylogenetic relatedness between the orthologous input sequences and thus should be well adapted towards using orthologous information. In this study, we evaluated the conditions under which complementing coregulation with orthologous information improves motif detection for the class of probabilistic motif detection algorithms with an explicit evolutionary model. We designed datasets (real and synthetic) covering different degrees of coregulation and orthologous information to test how well Phylogibbs and Phylogenetic sampler, as representatives of the motif detection algorithms with evolutionary model performed as compared to MEME, a more classical motif detection algorithm that treats orthologs independently. Under certain conditions detecting motifs in the combined coregulation-orthology space is indeed more efficient than using each space separately, but this is not always the case. Moreover, the difference in success rate between the advanced algorithms and MEME is still marginal. The success rate of motif detection depends on the complex interplay between the added information and the specificities of the applied algorithms. Insights in this relation provide information useful to both developers and users. All benchmark datasets are available at http://homes.esat.kuleuven.be/~kmarchal/Supplementary_Storms_Valerie_PlosONE.

  17. The Effect of Orthology and Coregulation on Detecting Regulatory Motifs

    PubMed Central

    Storms, Valerie; Claeys, Marleen; Sanchez, Aminael; De Moor, Bart; Verstuyf, Annemieke; Marchal, Kathleen

    2010-01-01

    Background Computational de novo discovery of transcription factor binding sites is still a challenging problem. The growing number of sequenced genomes allows integrating orthology evidence with coregulation information when searching for motifs. Moreover, the more advanced motif detection algorithms explicitly model the phylogenetic relatedness between the orthologous input sequences and thus should be well adapted towards using orthologous information. In this study, we evaluated the conditions under which complementing coregulation with orthologous information improves motif detection for the class of probabilistic motif detection algorithms with an explicit evolutionary model. Methodology We designed datasets (real and synthetic) covering different degrees of coregulation and orthologous information to test how well Phylogibbs and Phylogenetic sampler, as representatives of the motif detection algorithms with evolutionary model performed as compared to MEME, a more classical motif detection algorithm that treats orthologs independently. Results and Conclusions Under certain conditions detecting motifs in the combined coregulation-orthology space is indeed more efficient than using each space separately, but this is not always the case. Moreover, the difference in success rate between the advanced algorithms and MEME is still marginal. The success rate of motif detection depends on the complex interplay between the added information and the specificities of the applied algorithms. Insights in this relation provide information useful to both developers and users. All benchmark datasets are available at http://homes.esat.kuleuven.be/~kmarchal/Supplementary_Storms_Valerie_PlosONE. PMID:20140085

  18. Memetic algorithms for de novo motif-finding in biomedical sequences.

    PubMed

    Bi, Chengpeng

    2012-09-01

    The objectives of this study are to design and implement a new memetic algorithm for de novo motif discovery, which is then applied to detect important signals hidden in various biomedical molecular sequences. In this paper, memetic algorithms are developed and tested in de novo motif-finding problems. Several strategies in the algorithm design are employed that are to not only efficiently explore the multiple sequence local alignment space, but also effectively uncover the molecular signals. As a result, there are a number of key features in the implementation of the memetic motif-finding algorithm (MaMotif), including a chromosome replacement operator, a chromosome alteration-aware local search operator, a truncated local search strategy, and a stochastic operation of local search imposed on individual learning. To test the new algorithm, we compare MaMotif with a few of other similar algorithms using simulated and experimental data including genomic DNA, primary microRNA sequences (let-7 family), and transmembrane protein sequences. The new memetic motif-finding algorithm is successfully implemented in C++, and exhaustively tested with various simulated and real biological sequences. In the simulation, it shows that MaMotif is the most time-efficient algorithm compared with others, that is, it runs 2 times faster than the expectation maximization (EM) method and 16 times faster than the genetic algorithm-based EM hybrid. In both simulated and experimental testing, results show that the new algorithm is compared favorably or superior to other algorithms. Notably, MaMotif is able to successfully discover the transcription factors' binding sites in the chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) data, correctly uncover the RNA splicing signals in gene expression, and precisely find the highly conserved helix motif in the transmembrane protein sequences, as well as rightly detect the palindromic segments in the primary microRNA sequences. The memetic motif-finding algorithm is effectively designed and implemented, and its applications demonstrate it is not only time-efficient, but also exhibits excellent performance while compared with other popular algorithms. Copyright © 2012 Elsevier B.V. All rights reserved.

  19. Promzea: a pipeline for discovery of co-regulatory motifs in maize and other plant species and its application to the anthocyanin and phlobaphene biosynthetic pathways and the Maize Development Atlas.

    PubMed

    Liseron-Monfils, Christophe; Lewis, Tim; Ashlock, Daniel; McNicholas, Paul D; Fauteux, François; Strömvik, Martina; Raizada, Manish N

    2013-03-15

    The discovery of genetic networks and cis-acting DNA motifs underlying their regulation is a major objective of transcriptome studies. The recent release of the maize genome (Zea mays L.) has facilitated in silico searches for regulatory motifs. Several algorithms exist to predict cis-acting elements, but none have been adapted for maize. A benchmark data set was used to evaluate the accuracy of three motif discovery programs: BioProspector, Weeder and MEME. Analysis showed that each motif discovery tool had limited accuracy and appeared to retrieve a distinct set of motifs. Therefore, using the benchmark, statistical filters were optimized to reduce the false discovery ratio, and then remaining motifs from all programs were combined to improve motif prediction. These principles were integrated into a user-friendly pipeline for motif discovery in maize called Promzea, available at http://www.promzea.org and on the Discovery Environment of the iPlant Collaborative website. Promzea was subsequently expanded to include rice and Arabidopsis. Within Promzea, a user enters cDNA sequences or gene IDs; corresponding upstream sequences are retrieved from the maize genome. Predicted motifs are filtered, combined and ranked. Promzea searches the chosen plant genome for genes containing each candidate motif, providing the user with the gene list and corresponding gene annotations. Promzea was validated in silico using a benchmark data set: the Promzea pipeline showed a 22% increase in nucleotide sensitivity compared to the best standalone program tool, Weeder, with equivalent nucleotide specificity. Promzea was also validated by its ability to retrieve the experimentally defined binding sites of transcription factors that regulate the maize anthocyanin and phlobaphene biosynthetic pathways. Promzea predicted additional promoter motifs, and genome-wide motif searches by Promzea identified 127 non-anthocyanin/phlobaphene genes that each contained all five predicted promoter motifs in their promoters, perhaps uncovering a broader co-regulated gene network. Promzea was also tested against tissue-specific microarray data from maize. An online tool customized for promoter motif discovery in plants has been generated called Promzea. Promzea was validated in silico by its ability to retrieve benchmark motifs and experimentally defined motifs and was tested using tissue-specific microarray data. Promzea predicted broader networks of gene regulation associated with the historic anthocyanin and phlobaphene biosynthetic pathways. Promzea is a new bioinformatics tool for understanding transcriptional gene regulation in maize and has been expanded to include rice and Arabidopsis.

  20. SCOPE: a web server for practical de novo motif discovery.

    PubMed

    Carlson, Jonathan M; Chakravarty, Arijit; DeZiel, Charles E; Gross, Robert H

    2007-07-01

    SCOPE is a novel parameter-free method for the de novo identification of potential regulatory motifs in sets of coordinately regulated genes. The SCOPE algorithm combines the output of three component algorithms, each designed to identify a particular class of motifs. Using an ensemble learning approach, SCOPE identifies the best candidate motifs from its component algorithms. In tests on experimentally determined datasets, SCOPE identified motifs with a significantly higher level of accuracy than a number of other web-based motif finders run with their default parameters. Because SCOPE has no adjustable parameters, the web server has an intuitive interface, requiring only a set of gene names or FASTA sequences and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. Detailed motif information, including the sequence logo, PWM, consensus sequence and specific matching sites can be viewed through a single click on a motif. SCOPE's efficient, parameter-free search strategy has enabled the development of a web server that is readily accessible to the practising biologist while providing results that compare favorably with those of other motif finders. The SCOPE web server is at .

  1. A discrete artificial bee colony algorithm for detecting transcription factor binding sites in DNA sequences.

    PubMed

    Karaboga, D; Aslan, S

    2016-04-27

    The great majority of biological sequences share significant similarity with other sequences as a result of evolutionary processes, and identifying these sequence similarities is one of the most challenging problems in bioinformatics. In this paper, we present a discrete artificial bee colony (ABC) algorithm, which is inspired by the intelligent foraging behavior of real honey bees, for the detection of highly conserved residue patterns or motifs within sequences. Experimental studies on three different data sets showed that the proposed discrete model, by adhering to the fundamental scheme of the ABC algorithm, produced competitive or better results than other metaheuristic motif discovery techniques.

  2. Crystal Structure Predictions Using Adaptive Genetic Algorithm and Motif Search methods

    NASA Astrophysics Data System (ADS)

    Ho, K. M.; Wang, C. Z.; Zhao, X.; Wu, S.; Lyu, X.; Zhu, Z.; Nguyen, M. C.; Umemoto, K.; Wentzcovitch, R. M. M.

    2017-12-01

    Material informatics is a new initiative which has attracted a lot of attention in recent scientific research. The basic strategy is to construct comprehensive data sets and use machine learning to solve a wide variety of problems in material design and discovery. In pursuit of this goal, a key element is the quality and completeness of the databases used. Recent advance in the development of crystal structure prediction algorithms has made it a complementary and more efficient approach to explore the structure/phase space in materials using computers. In this talk, we discuss the importance of the structural motifs and motif-networks in crystal structure predictions. Correspondingly, powerful methods are developed to improve the sampling of the low-energy structure landscape.

  3. LDsplit: screening for cis-regulatory motifs stimulating meiotic recombination hotspots by analysis of DNA sequence polymorphisms.

    PubMed

    Yang, Peng; Wu, Min; Guo, Jing; Kwoh, Chee Keong; Przytycka, Teresa M; Zheng, Jie

    2014-02-17

    As a fundamental genomic element, meiotic recombination hotspot plays important roles in life sciences. Thus uncovering its regulatory mechanisms has broad impact on biomedical research. Despite the recent identification of the zinc finger protein PRDM9 and its 13-mer binding motif as major regulators for meiotic recombination hotspots, other regulators remain to be discovered. Existing methods for finding DNA sequence motifs of recombination hotspots often rely on the enrichment of co-localizations between hotspots and short DNA patterns, which ignore the cross-individual variation of recombination rates and sequence polymorphisms in the population. Our objective in this paper is to capture signals encoded in genetic variations for the discovery of recombination-associated DNA motifs. Recently, an algorithm called "LDsplit" has been designed to detect the association between single nucleotide polymorphisms (SNPs) and proximal meiotic recombination hotspots. The association is measured by the difference of population recombination rates at a hotspot between two alleles of a candidate SNP. Here we present an open source software tool of LDsplit, with integrative data visualization for recombination hotspots and their proximal SNPs. Applying LDsplit on SNPs inside an established 7-mer motif bound by PRDM9 we observed that SNP alleles preserving the original motif tend to have higher recombination rates than the opposite alleles that disrupt the motif. Running on SNP windows around hotspots each containing an occurrence of the 7-mer motif, LDsplit is able to guide the established motif finding algorithm of MEME to recover the 7-mer motif. In contrast, without LDsplit the 7-mer motif could not be identified. LDsplit is a software tool for the discovery of cis-regulatory DNA sequence motifs stimulating meiotic recombination hotspots by screening and narrowing down to hotspot associated SNPs. It is the first computational method that utilizes the genetic variation of recombination hotspots among individuals, opening a new avenue for motif finding. Tested on an established motif and simulated datasets, LDsplit shows promise to discover novel DNA motifs for meiotic recombination hotspots.

  4. LDsplit: screening for cis-regulatory motifs stimulating meiotic recombination hotspots by analysis of DNA sequence polymorphisms

    PubMed Central

    2014-01-01

    Background As a fundamental genomic element, meiotic recombination hotspot plays important roles in life sciences. Thus uncovering its regulatory mechanisms has broad impact on biomedical research. Despite the recent identification of the zinc finger protein PRDM9 and its 13-mer binding motif as major regulators for meiotic recombination hotspots, other regulators remain to be discovered. Existing methods for finding DNA sequence motifs of recombination hotspots often rely on the enrichment of co-localizations between hotspots and short DNA patterns, which ignore the cross-individual variation of recombination rates and sequence polymorphisms in the population. Our objective in this paper is to capture signals encoded in genetic variations for the discovery of recombination-associated DNA motifs. Results Recently, an algorithm called “LDsplit” has been designed to detect the association between single nucleotide polymorphisms (SNPs) and proximal meiotic recombination hotspots. The association is measured by the difference of population recombination rates at a hotspot between two alleles of a candidate SNP. Here we present an open source software tool of LDsplit, with integrative data visualization for recombination hotspots and their proximal SNPs. Applying LDsplit on SNPs inside an established 7-mer motif bound by PRDM9 we observed that SNP alleles preserving the original motif tend to have higher recombination rates than the opposite alleles that disrupt the motif. Running on SNP windows around hotspots each containing an occurrence of the 7-mer motif, LDsplit is able to guide the established motif finding algorithm of MEME to recover the 7-mer motif. In contrast, without LDsplit the 7-mer motif could not be identified. Conclusions LDsplit is a software tool for the discovery of cis-regulatory DNA sequence motifs stimulating meiotic recombination hotspots by screening and narrowing down to hotspot associated SNPs. It is the first computational method that utilizes the genetic variation of recombination hotspots among individuals, opening a new avenue for motif finding. Tested on an established motif and simulated datasets, LDsplit shows promise to discover novel DNA motifs for meiotic recombination hotspots. PMID:24533858

  5. BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements.

    PubMed

    De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan

    2015-12-01

    The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  6. BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements

    PubMed Central

    De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan

    2015-01-01

    Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. Availability and implementation: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Contact: Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26254488

  7. Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

    PubMed

    Zhang, ZhiZhuo; Chang, Cheng Wei; Hugo, Willy; Cheung, Edwin; Sung, Wing-Kin

    2013-03-01

    Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.

  8. Motif discovery and motif finding from genome-mapped DNase footprint data.

    PubMed

    Kulakovskiy, Ivan V; Favorov, Alexander V; Makeev, Vsevolod J

    2009-09-15

    Footprint data is an important source of information on transcription factor recognition motifs. However, a footprinting fragment can contain no sequences similar to known protein recognition sites. Inspection of genome fragments nearby can help to identify missing site positions. Genome fragments containing footprints were supplied to a pipeline that constructed a position weight matrix (PWM) for different motif lengths and selected the optimal PWM. Fragments were aligned with the SeSiMCMC sampler and a new heuristic algorithm, Bigfoot. Footprints with missing hits were found for approximately 50% of factors. Adding only 2 bp on both sides of a footprinting fragment recovered most hits. We automatically constructed motifs for 41 Drosophila factors. New motifs can recognize footprints with a greater sensitivity at the same false positive rate than existing models. Also we discuss possible overfitting of constructed motifs. Software and the collection of regulatory motifs are freely available at http://line.imb.ac.ru/DMMPMM.

  9. An novel frequent probability pattern mining algorithm based on circuit simulation method in uncertain biological networks.

    PubMed

    He, Jieyue; Wang, Chunyan; Qiu, Kunpu; Zhong, Wei

    2014-01-01

    Motif mining has always been a hot research topic in bioinformatics. Most of current research on biological networks focuses on exact motif mining. However, due to the inevitable experimental error and noisy data, biological network data represented as the probability model could better reflect the authenticity and biological significance, therefore, it is more biological meaningful to discover probability motif in uncertain biological networks. One of the key steps in probability motif mining is frequent pattern discovery which is usually based on the possible world model having a relatively high computational complexity. In this paper, we present a novel method for detecting frequent probability patterns based on circuit simulation in the uncertain biological networks. First, the partition based efficient search is applied to the non-tree like subgraph mining where the probability of occurrence in random networks is small. Then, an algorithm of probability isomorphic based on circuit simulation is proposed. The probability isomorphic combines the analysis of circuit topology structure with related physical properties of voltage in order to evaluate the probability isomorphism between probability subgraphs. The circuit simulation based probability isomorphic can avoid using traditional possible world model. Finally, based on the algorithm of probability subgraph isomorphism, two-step hierarchical clustering method is used to cluster subgraphs, and discover frequent probability patterns from the clusters. The experiment results on data sets of the Protein-Protein Interaction (PPI) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae show that the proposed method can efficiently discover the frequent probability subgraphs. The discovered subgraphs in our study contain all probability motifs reported in the experiments published in other related papers. The algorithm of probability graph isomorphism evaluation based on circuit simulation method excludes most of subgraphs which are not probability isomorphism and reduces the search space of the probability isomorphism subgraphs using the mismatch values in the node voltage set. It is an innovative way to find the frequent probability patterns, which can be efficiently applied to probability motif discovery problems in the further studies.

  10. An novel frequent probability pattern mining algorithm based on circuit simulation method in uncertain biological networks

    PubMed Central

    2014-01-01

    Background Motif mining has always been a hot research topic in bioinformatics. Most of current research on biological networks focuses on exact motif mining. However, due to the inevitable experimental error and noisy data, biological network data represented as the probability model could better reflect the authenticity and biological significance, therefore, it is more biological meaningful to discover probability motif in uncertain biological networks. One of the key steps in probability motif mining is frequent pattern discovery which is usually based on the possible world model having a relatively high computational complexity. Methods In this paper, we present a novel method for detecting frequent probability patterns based on circuit simulation in the uncertain biological networks. First, the partition based efficient search is applied to the non-tree like subgraph mining where the probability of occurrence in random networks is small. Then, an algorithm of probability isomorphic based on circuit simulation is proposed. The probability isomorphic combines the analysis of circuit topology structure with related physical properties of voltage in order to evaluate the probability isomorphism between probability subgraphs. The circuit simulation based probability isomorphic can avoid using traditional possible world model. Finally, based on the algorithm of probability subgraph isomorphism, two-step hierarchical clustering method is used to cluster subgraphs, and discover frequent probability patterns from the clusters. Results The experiment results on data sets of the Protein-Protein Interaction (PPI) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae show that the proposed method can efficiently discover the frequent probability subgraphs. The discovered subgraphs in our study contain all probability motifs reported in the experiments published in other related papers. Conclusions The algorithm of probability graph isomorphism evaluation based on circuit simulation method excludes most of subgraphs which are not probability isomorphism and reduces the search space of the probability isomorphism subgraphs using the mismatch values in the node voltage set. It is an innovative way to find the frequent probability patterns, which can be efficiently applied to probability motif discovery problems in the further studies. PMID:25350277

  11. WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches

    PubMed Central

    Romer, Katherine A.; Kayombya, Guy-Richard; Fraenkel, Ernest

    2007-01-01

    WebMOTIFS provides a web interface that facilitates the discovery and analysis of DNA-sequence motifs. Several studies have shown that the accuracy of motif discovery can be significantly improved by using multiple de novo motif discovery programs and using randomized control calculations to identify the most significant motifs or by using Bayesian approaches. WebMOTIFS makes it easy to apply these strategies. Using a single submission form, users can run several motif discovery programs and score, cluster and visualize the results. In addition, the Bayesian motif discovery program THEME can be used to determine the class of transcription factors that is most likely to regulate a set of sequences. Input can be provided as a list of gene or probe identifiers. Used with the default settings, WebMOTIFS accurately identifies biologically relevant motifs from diverse data in several species. WebMOTIFS is freely available at http://fraenkel.mit.edu/webmotifs. PMID:17584794

  12. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  13. mHealth Visual Discovery Dashboard.

    PubMed

    Fang, Dezhi; Hohman, Fred; Polack, Peter; Sarker, Hillol; Kahng, Minsuk; Sharmin, Moushumi; al'Absi, Mustafa; Chau, Duen Horng

    2017-09-01

    We present Discovery Dashboard, a visual analytics system for exploring large volumes of time series data from mobile medical field studies. Discovery Dashboard offers interactive exploration tools and a data mining motif discovery algorithm to help researchers formulate hypotheses, discover trends and patterns, and ultimately gain a deeper understanding of their data. Discovery Dashboard emphasizes user freedom and flexibility during the data exploration process and enables researchers to do things previously challenging or impossible to do - in the web-browser and in real time. We demonstrate our system visualizing data from a mobile sensor study conducted at the University of Minnesota that included 52 participants who were trying to quit smoking.

  14. mHealth Visual Discovery Dashboard

    PubMed Central

    Fang, Dezhi; Hohman, Fred; Polack, Peter; Sarker, Hillol; Kahng, Minsuk; Sharmin, Moushumi; al'Absi, Mustafa; Chau, Duen Horng

    2018-01-01

    We present Discovery Dashboard, a visual analytics system for exploring large volumes of time series data from mobile medical field studies. Discovery Dashboard offers interactive exploration tools and a data mining motif discovery algorithm to help researchers formulate hypotheses, discover trends and patterns, and ultimately gain a deeper understanding of their data. Discovery Dashboard emphasizes user freedom and flexibility during the data exploration process and enables researchers to do things previously challenging or impossible to do — in the web-browser and in real time. We demonstrate our system visualizing data from a mobile sensor study conducted at the University of Minnesota that included 52 participants who were trying to quit smoking. PMID:29354812

  15. A study on the application of topic models to motif finding algorithms.

    PubMed

    Basha Gutierrez, Josep; Nakai, Kenta

    2016-12-22

    Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.

  16. Counting of oligomers in sequences generated by markov chains for DNA motif discovery.

    PubMed

    Shan, Gao; Zheng, Wei-Mou

    2009-02-01

    By means of the technique of the imbedded Markov chain, an efficient algorithm is proposed to exactly calculate first, second moments of word counts and the probability for a word to occur at least once in random texts generated by a Markov chain. A generating function is introduced directly from the imbedded Markov chain to derive asymptotic approximations for the problem. Two Z-scores, one based on the number of sequences with hits and the other on the total number of word hits in a set of sequences, are examined for discovery of motifs on a set of promoter sequences extracted from A. thaliana genome. Source code is available at http://www.itp.ac.cn/zheng/oligo.c.

  17. SLIDER: a generic metaheuristic for the discovery of correlated motifs in protein-protein interaction networks.

    PubMed

    Boyen, Peter; Van Dyck, Dries; Neven, Frank; van Ham, Roeland C H J; van Dijk, Aalt D J

    2011-01-01

    Correlated motif mining (cmm) is the problem of finding overrepresented pairs of patterns, called motifs, in sequences of interacting proteins. Algorithmic solutions for cmm thereby provide a computational method for predicting binding sites for protein interaction. In this paper, we adopt a motif-driven approach where the support of candidate motif pairs is evaluated in the network. We experimentally establish the superiority of the Chi-square-based support measure over other support measures. Furthermore, we obtain that cmm is an np-hard problem for a large class of support measures (including Chi-square) and reformulate the search for correlated motifs as a combinatorial optimization problem. We then present the generic metaheuristic slider which uses steepest ascent with a neighborhood function based on sliding motifs and employs the Chi-square-based support measure. We show that slider outperforms existing motif-driven cmm methods and scales to large protein-protein interaction networks. The slider-implementation and the data used in the experiments are available on http://bioinformatics.uhasselt.be.

  18. Assessment of composite motif discovery methods.

    PubMed

    Klepper, Kjetil; Sandve, Geir K; Abul, Osman; Johansen, Jostein; Drablos, Finn

    2008-02-26

    Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery - discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery.

  19. Motif-based analysis of large nucleotide data sets using MEME-ChIP

    PubMed Central

    Ma, Wenxiu; Noble, William S; Bailey, Timothy L

    2014-01-01

    MEME-ChIP is a web-based tool for analyzing motifs in large DNA or RNA data sets. It can analyze peak regions identified by ChIP-seq, cross-linking sites identified by cLIP-seq and related assays, as well as sets of genomic regions selected using other criteria. MEME-ChIP performs de novo motif discovery, motif enrichment analysis, motif location analysis and motif clustering, providing a comprehensive picture of the DNA or RNA motifs that are enriched in the input sequences. MEME-ChIP performs two complementary types of de novo motif discovery: weight matrix–based discovery for high accuracy; and word-based discovery for high sensitivity. Motif enrichment analysis using DNA or RNA motifs from human, mouse, worm, fly and other model organisms provides even greater sensitivity. MEME-ChIP’s interactive HTML output groups and aligns significant motifs to ease interpretation. this protocol takes less than 3 h, and it provides motif discovery approaches that are distinct and complementary to other online methods. PMID:24853928

  20. qPMS9: An Efficient Algorithm for Quorum Planted Motif Search

    NASA Astrophysics Data System (ADS)

    Nicolae, Marius; Rajasekaran, Sanguthevar

    2015-01-01

    Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (l, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers l and d. It returns all sequences M of length l that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (l, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

  1. Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools.

    PubMed

    Cer, Regina Z; Donohue, Duncan E; Mudunuri, Uma S; Temiz, Nuri A; Loss, Michael A; Starner, Nathan J; Halusa, Goran N; Volfovsky, Natalia; Yi, Ming; Luke, Brian T; Bacolla, Albino; Collins, Jack R; Stephens, Robert M

    2013-01-01

    The non-B DB, available at http://nonb.abcc.ncifcrf.gov, catalogs predicted non-B DNA-forming sequence motifs, including Z-DNA, G-quadruplex, A-phased repeats, inverted repeats, mirror repeats, direct repeats and their corresponding subsets: cruciforms, triplexes and slipped structures, in several genomes. Version 2.0 of the database revises and re-implements the motif discovery algorithms to better align with accepted definitions and thresholds for motifs, expands the non-B DNA-forming motifs coverage by including short tandem repeats and adds key visualization tools to compare motif locations relative to other genomic annotations. Non-B DB v2.0 extends the ability for comparative genomics by including re-annotation of the five organisms reported in non-B DB v1.0, human, chimpanzee, dog, macaque and mouse, and adds seven additional organisms: orangutan, rat, cow, pig, horse, platypus and Arabidopsis thaliana. Additionally, the non-B DB v2.0 provides an overall improved graphical user interface and faster query performance.

  2. Discovering Sequence Motifs with Arbitrary Insertions and Deletions

    PubMed Central

    Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan; Bailey, Timothy L.

    2008-01-01

    Biology is encoded in molecular sequences: deciphering this encoding remains a grand scientific challenge. Functional regions of DNA, RNA, and protein sequences often exhibit characteristic but subtle motifs; thus, computational discovery of motifs in sequences is a fundamental and much-studied problem. However, most current algorithms do not allow for insertions or deletions (indels) within motifs, and the few that do have other limitations. We present a method, GLAM2 (Gapped Local Alignment of Motifs), for discovering motifs allowing indels in a fully general manner, and a companion method GLAM2SCAN for searching sequence databases using such motifs. glam2 is a generalization of the gapless Gibbs sampling algorithm. It re-discovers variable-width protein motifs from the PROSITE database significantly more accurately than the alternative methods PRATT and SAM-T2K. Furthermore, it usefully refines protein motifs from the ELM database: in some cases, the refined motifs make orders of magnitude fewer overpredictions than the original ELM regular expressions. GLAM2 performs respectably on the BAliBASE multiple alignment benchmark, and may be superior to leading multiple alignment methods for “motif-like” alignments with N- and C-terminal extensions. Finally, we demonstrate the use of GLAM2 to discover protein kinase substrate motifs and a gapped DNA motif for the LIM-only transcriptional regulatory complex: using GLAM2SCAN, we identify promising targets for the latter. GLAM2 is especially promising for short protein motifs, and it should improve our ability to identify the protein cleavage sites, interaction sites, post-translational modification attachment sites, etc., that underlie much of biology. It may be equally useful for arbitrarily gapped motifs in DNA and RNA, although fewer examples of such motifs are known at present. GLAM2 is public domain software, available for download at http://bioinformatics.org.au/glam2. PMID:18437229

  3. Systematic and fully automated identification of protein sequence patterns.

    PubMed

    Hart, R K; Royyuru, A K; Stolovitzky, G; Califano, A

    2000-01-01

    We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.

  4. A New Algorithm for Identifying Cis-Regulatory Modules Based on Hidden Markov Model

    PubMed Central

    2017-01-01

    The discovery of cis-regulatory modules (CRMs) is the key to understanding mechanisms of transcription regulation. Since CRMs have specific regulatory structures that are the basis for the regulation of gene expression, how to model the regulatory structure of CRMs has a considerable impact on the performance of CRM identification. The paper proposes a CRM discovery algorithm called ComSPS. ComSPS builds a regulatory structure model of CRMs based on HMM by exploring the rules of CRM transcriptional grammar that governs the internal motif site arrangement of CRMs. We test ComSPS on three benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them. PMID:28497059

  5. Regulatory sequence analysis tools.

    PubMed

    van Helden, Jacques

    2003-07-01

    The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.

  6. Direct AUC optimization of regulatory motifs.

    PubMed

    Zhu, Lin; Zhang, Hong-Bo; Huang, De-Shuang

    2017-07-15

    The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8 . dshuang@tongji.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  7. De novo discovery of structural motifs in RNA 3D structures through clustering.

    PubMed

    Ge, Ping; Islam, Shahidul; Zhong, Cuncong; Zhang, Shaojie

    2018-05-18

    As functional components in three-dimensional (3D) conformation of an RNA, the RNA structural motifs provide an easy way to associate the molecular architectures with their biological mechanisms. In the past years, many computational tools have been developed to search motif instances by using the existing knowledge of well-studied families. Recently, with the rapidly increasing number of resolved RNA 3D structures, there is an urgent need to discover novel motifs with the newly presented information. In this work, we classify all the loops in non-redundant RNA 3D structures to detect plausible RNA structural motif families by using a clustering pipeline. Compared with other clustering approaches, our method has two benefits: first, the underlying alignment algorithm is tolerant to the variations in 3D structures. Second, sophisticated downstream analysis has been performed to ensure the clusters are valid and easily applied to further research. The final clustering results contain many interesting new variants of known motif families, such as GNAA tetraloop, kink-turn, sarcin-ricin and T-loop. We have also discovered potential novel functional motifs conserved in ribosomal RNA, sgRNA, SRP RNA, riboswitch and ribozyme.

  8. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets.

    PubMed

    Vishnevsky, Oleg V; Bocharnikov, Andrey V; Kolchanov, Nikolay A

    2018-02-01

    The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.

  9. Robust and Accurate Anomaly Detection in ECG Artifacts Using Time Series Motif Discovery

    PubMed Central

    Sivaraks, Haemwaan

    2015-01-01

    Electrocardiogram (ECG) anomaly detection is an important technique for detecting dissimilar heartbeats which helps identify abnormal ECGs before the diagnosis process. Currently available ECG anomaly detection methods, ranging from academic research to commercial ECG machines, still suffer from a high false alarm rate because these methods are not able to differentiate ECG artifacts from real ECG signal, especially, in ECG artifacts that are similar to ECG signals in terms of shape and/or frequency. The problem leads to high vigilance for physicians and misinterpretation risk for nonspecialists. Therefore, this work proposes a novel anomaly detection technique that is highly robust and accurate in the presence of ECG artifacts which can effectively reduce the false alarm rate. Expert knowledge from cardiologists and motif discovery technique is utilized in our design. In addition, every step of the algorithm conforms to the interpretation of cardiologists. Our method can be utilized to both single-lead ECGs and multilead ECGs. Our experiment results on real ECG datasets are interpreted and evaluated by cardiologists. Our proposed algorithm can mostly achieve 100% of accuracy on detection (AoD), sensitivity, specificity, and positive predictive value with 0% false alarm rate. The results demonstrate that our proposed method is highly accurate and robust to artifacts, compared with competitive anomaly detection methods. PMID:25688284

  10. The nitrogen responsive transcriptome in potato (Solanum tuberosum L.) reveals significant gene regulatory motifs.

    PubMed

    Gálvez, José Héctor; Tai, Helen H; Lagüe, Martin; Zebarth, Bernie J; Strömvik, Martina V

    2016-05-19

    Nitrogen (N) is the most important nutrient for the growth of potato (Solanum tuberosum L.). Foliar gene expression in potato plants with and without N supplementation at 180 kg N ha(-1) was compared at mid-season. Genes with consistent differences in foliar expression due to N supplementation over three cultivars and two developmental time points were examined. In total, thirty genes were found to be over-expressed and nine genes were found to be under-expressed with supplemented N. Functional relationships between over-expressed genes were found. The main metabolic pathway represented among differentially expressed genes was amino acid metabolism. The 1000 bp upstream flanking regions of the differentially expressed genes were analysed and nine overrepresented motifs were found using three motif discovery algorithms (Seeder, Weeder and MEME). These results point to coordinated gene regulation at the transcriptional level controlling steady state potato responses to N sufficiency.

  11. The nitrogen responsive transcriptome in potato (Solanum tuberosum L.) reveals significant gene regulatory motifs

    PubMed Central

    Gálvez, José Héctor; Tai, Helen H.; Lagüe, Martin; Zebarth, Bernie J.; Strömvik, Martina V.

    2016-01-01

    Nitrogen (N) is the most important nutrient for the growth of potato (Solanum tuberosum L.). Foliar gene expression in potato plants with and without N supplementation at 180 kg N ha−1 was compared at mid-season. Genes with consistent differences in foliar expression due to N supplementation over three cultivars and two developmental time points were examined. In total, thirty genes were found to be over-expressed and nine genes were found to be under-expressed with supplemented N. Functional relationships between over-expressed genes were found. The main metabolic pathway represented among differentially expressed genes was amino acid metabolism. The 1000 bp upstream flanking regions of the differentially expressed genes were analysed and nine overrepresented motifs were found using three motif discovery algorithms (Seeder, Weeder and MEME). These results point to coordinated gene regulation at the transcriptional level controlling steady state potato responses to N sufficiency. PMID:27193058

  12. DLocalMotif: a discriminative approach for discovering local motifs in protein sequences.

    PubMed

    Mehdi, Ahmed M; Sehgal, Muhammad Shoaib B; Kobe, Bostjan; Bailey, Timothy L; Bodén, Mikael

    2013-01-01

    Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery. This article introduces the method DLocalMotif that makes use of positional information and negative data for local motif discovery in protein sequences. DLocalMotif combines three scoring functions, measuring degrees of motif over-representation, entropy and spatial confinement, specifically designed to discriminatively exploit the availability of negative data. The method is shown to outperform current methods that use only a subset of these motif characteristics. We apply the method to several biological datasets. The analysis of peroxisomal targeting signals uncovers several novel motifs that occur immediately upstream of the dominant peroxisomal targeting signal-1 signal. The analysis of proline-tyrosine nuclear localization signals uncovers multiple novel motifs that overlap with C2H2 zinc finger domains. We also evaluate the method on classical nuclear localization signals and endoplasmic reticulum retention signals and find that DLocalMotif successfully recovers biologically relevant sequence properties. http://bioinf.scmb.uq.edu.au/dlocalmotif/

  13. DNA motif alignment by evolving a population of Markov chains.

    PubMed

    Bi, Chengpeng

    2009-01-30

    Deciphering cis-regulatory elements or de novo motif-finding in genomes still remains elusive although much algorithmic effort has been expended. The Markov chain Monte Carlo (MCMC) method such as Gibbs motif samplers has been widely employed to solve the de novo motif-finding problem through sequence local alignment. Nonetheless, the MCMC-based motif samplers still suffer from local maxima like EM. Therefore, as a prerequisite for finding good local alignments, these motif algorithms are often independently run a multitude of times, but without information exchange between different chains. Hence it would be worth a new algorithm design enabling such information exchange. This paper presents a novel motif-finding algorithm by evolving a population of Markov chains with information exchange (PMC), each of which is initialized as a random alignment and run by the Metropolis-Hastings sampler (MHS). It is progressively updated through a series of local alignments stochastically sampled. Explicitly, the PMC motif algorithm performs stochastic sampling as specified by a population-based proposal distribution rather than individual ones, and adaptively evolves the population as a whole towards a global maximum. The alignment information exchange is accomplished by taking advantage of the pooled motif site distributions. A distinct method for running multiple independent Markov chains (IMC) without information exchange, or dubbed as the IMC motif algorithm, is also devised to compare with its PMC counterpart. Experimental studies demonstrate that the performance could be improved if pooled information were used to run a population of motif samplers. The new PMC algorithm was able to improve the convergence and outperformed other popular algorithms tested using simulated and biological motif sequences.

  14. Using SCOPE to identify potential regulatory motifs in coregulated genes.

    PubMed

    Martyanov, Viktor; Gross, Robert H

    2011-05-31

    SCOPE is an ensemble motif finder that uses three component algorithms in parallel to identify potential regulatory motifs by over-representation and motif position preference. Each component algorithm is optimized to find a different kind of motif. By taking the best of these three approaches, SCOPE performs better than any single algorithm, even in the presence of noisy data. In this article, we utilize a web version of SCOPE to examine genes that are involved in telomere maintenance. SCOPE has been incorporated into at least two other motif finding programs and has been used in other studies. The three algorithms that comprise SCOPE are BEAM, which finds non-degenerate motifs (ACCGGT), PRISM, which finds degenerate motifs (ASCGWT), and SPACER, which finds longer bipartite motifs (ACCnnnnnnnnGGT). These three algorithms have been optimized to find their corresponding type of motif. Together, they allow SCOPE to perform extremely well. Once a gene set has been analyzed and candidate motifs identified, SCOPE can look for other genes that contain the motif which, when added to the original set, will improve the motif score. This can occur through over-representation or motif position preference. Working with partial gene sets that have biologically verified transcription factor binding sites, SCOPE was able to identify most of the rest of the genes also regulated by the given transcription factor. Output from SCOPE shows candidate motifs, their significance, and other information both as a table and as a graphical motif map. FAQs and video tutorials are available at the SCOPE web site which also includes a "Sample Search" button that allows the user to perform a trial run. Scope has a very friendly user interface that enables novice users to access the algorithm's full power without having to become an expert in the bioinformatics of motif finding. As input, SCOPE can take a list of genes, or FASTA sequences. These can be entered in browser text fields, or read from a file. The output from SCOPE contains a list of all identified motifs with their scores, number of occurrences, fraction of genes containing the motif, and the algorithm used to identify the motif. For each motif, result details include a consensus representation of the motif, a sequence logo, a position weight matrix, and a list of instances for every motif occurrence (with exact positions and "strand" indicated). Results are returned in a browser window and also optionally by email. Previous papers describe the SCOPE algorithms in detail.

  15. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space

    PubMed Central

    Karnik, Rahul; Beer, Michael A.

    2015-01-01

    The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs. PMID:26465884

  16. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space.

    PubMed

    Karnik, Rahul; Beer, Michael A

    2015-01-01

    The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.

  17. cWINNOWER Algorithm for Finding Fuzzy DNA Motifs

    NASA Technical Reports Server (NTRS)

    Liang, Shoudan

    2003-01-01

    The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if multiple mutated copies of the motif (i.e., the signals) are present in the DNA sequence in sufficient abundance. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum number of detectable motifs qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces qc, by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12000 for (l,d) = (15,4).

  18. Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences.

    PubMed

    Defrance, Matthieu; Janky, Rekin's; Sand, Olivier; van Helden, Jacques

    2008-01-01

    This protocol explains how to discover functional signals in genomic sequences by detecting over- or under-represented oligonucleotides (words) or spaced pairs thereof (dyads) with the Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Two typical applications are presented: (i) predicting transcription factor-binding motifs in promoters of coregulated genes and (ii) discovering phylogenetic footprints in promoters of orthologous genes. The steps of this protocol include purging genomic sequences to discard redundant fragments, discovering over-represented patterns and assembling them to obtain degenerate motifs, scanning sequences and drawing feature maps. The main strength of the method is its statistical ground: the binomial significance provides an efficient control on the rate of false positives. In contrast with optimization-based pattern discovery algorithms, the method supports the detection of under- as well as over-represented motifs. Computation times vary from seconds (gene clusters) to minutes (whole genomes). The execution of the whole protocol should take approximately 1 h.

  19. ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data.

    PubMed

    Wu, Song; Wang, Jianmin; Zhao, Wei; Pounds, Stanley; Cheng, Cheng

    2010-06-03

    ChIP-Seq is a powerful tool for identifying the interaction between genomic regulators and their bound DNAs, especially for locating transcription factor binding sites. However, high cost and high rate of false discovery of transcription factor binding sites identified from ChIP-Seq data significantly limit its application. Here we report a new algorithm, ChIP-PaM, for identifying transcription factor target regions in ChIP-Seq datasets. This algorithm makes full use of a protein-DNA binding pattern by capitalizing on three lines of evidence: 1) the tag count modelling at the peak position, 2) pattern matching of a specific tag count distribution, and 3) motif searching along the genome. A novel data-based two-step eFDR procedure is proposed to integrate the three lines of evidence to determine significantly enriched regions. Our algorithm requires no technical controls and efficiently discriminates falsely enriched regions from regions enriched by true transcription factor (TF) binding on the basis of ChIP-Seq data only. An analysis of real genomic data is presented to demonstrate our method. In a comparison with other existing methods, we found that our algorithm provides more accurate binding site discovery while maintaining comparable statistical power.

  20. Evaluation of phylogenetic footprint discovery for predicting bacterial cis-regulatory elements and revealing their evolution.

    PubMed

    Janky, Rekin's; van Helden, Jacques

    2008-01-23

    The detection of conserved motifs in promoters of orthologous genes (phylogenetic footprints) has become a common strategy to predict cis-acting regulatory elements. Several software tools are routinely used to raise hypotheses about regulation. However, these tools are generally used as black boxes, with default parameters. A systematic evaluation of optimal parameters for a footprint discovery strategy can bring a sizeable improvement to the predictions. We evaluate the performances of a footprint discovery approach based on the detection of over-represented spaced motifs. This method is particularly suitable for (but not restricted to) Bacteria, since such motifs are typically bound by factors containing a Helix-Turn-Helix domain. We evaluated footprint discovery in 368 Escherichia coli K12 genes with annotated sites, under 40 different combinations of parameters (taxonomical level, background model, organism-specific filtering, operon inference). Motifs are assessed both at the levels of correctness and significance. We further report a detailed analysis of 181 bacterial orthologs of the LexA repressor. Distinct motifs are detected at various taxonomical levels, including the 7 previously characterized taxon-specific motifs. In addition, we highlight a significantly stronger conservation of half-motifs in Actinobacteria, relative to Firmicutes, suggesting an intermediate state in specificity switching between the two Gram-positive phyla, and thereby revealing the on-going evolution of LexA auto-regulation. The footprint discovery method proposed here shows excellent results with E. coli and can readily be extended to predict cis-acting regulatory signals and propose testable hypotheses in bacterial genomes for which nothing is known about regulation.

  1. Motif discovery with data mining in 3D protein structure databases: discovery, validation and prediction of the U-shape zinc binding ("Huf-Zinc") motif.

    PubMed

    Maurer-Stroh, Sebastian; Gao, He; Han, Hao; Baeten, Lies; Schymkowitz, Joost; Rousseau, Frederic; Zhang, Louxin; Eisenhaber, Frank

    2013-02-01

    Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif--structural motif--function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL (http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/).

  2. De Novo Regulatory Motif Discovery Identifies Significant Motifs in Promoters of Five Classes of Plant Dehydrin Genes.

    PubMed

    Zolotarov, Yevgen; Strömvik, Martina

    2015-01-01

    Plants accumulate dehydrins in response to osmotic stresses. Dehydrins are divided into five different classes, which are thought to be regulated in different manners. To better understand differences in transcriptional regulation of the five dehydrin classes, de novo motif discovery was performed on 350 dehydrin promoter sequences from a total of 51 plant genomes. Overrepresented motifs were identified in the promoters of five dehydrin classes. The Kn dehydrin promoters contain motifs linked with meristem specific expression, as well as motifs linked with cold/dehydration and abscisic acid response. KS dehydrin promoters contain a motif with a GATA core. SKn and YnSKn dehydrin promoters contain motifs that match elements connected with cold/dehydration, abscisic acid and light response. YnKn dehydrin promoters contain motifs that match abscisic acid and light response elements, but not cold/dehydration response elements. Conserved promoter motifs are present in the dehydrin classes and across different plant lineages, indicating that dehydrin gene regulation is likely also conserved.

  3. The BaMM web server for de-novo motif discovery and regulatory sequence analysis.

    PubMed

    Kiesel, Anja; Roth, Christian; Ge, Wanwan; Wess, Maximilian; Meier, Markus; Söding, Johannes

    2018-05-28

    The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.

  4. cWINNOWER algorithm for finding fuzzy dna motifs

    NASA Technical Reports Server (NTRS)

    Liang, S.; Samanta, M. P.; Biegel, B. A.

    2004-01-01

    The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if a clique consisting of a sufficiently large number of mutated copies of the motif (i.e., the signals) is present in the DNA sequence. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum detectable clique size qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces qc by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12,000 for (l, d) = (15, 4). Copyright Imperial College Press.

  5. Efficient sequential and parallel algorithms for finding edit distance based motifs.

    PubMed

    Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar

    2016-08-18

    Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. One popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs. The algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads. Our algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in this paper are also applicable to other motif search problems such as Planted Motif Search (PMS) and Simple Motif Search (SMS).

  6. Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets

    PubMed Central

    2012-01-01

    Background To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. Results We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. Conclusions SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery. PMID:23281852

  7. Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets.

    PubMed

    Chiu, Yi-Yuan; Lin, Chun-Yu; Lin, Chih-Ta; Hsu, Kai-Cheng; Chang, Li-Zen; Yang, Jinn-Moon

    2012-01-01

    To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery.

  8. Sequence- and Interactome-Based Prediction of Viral Protein Hotspots Targeting Host Proteins: A Case Study for HIV Nef

    PubMed Central

    Sarmady, Mahdi; Dampier, William; Tozeren, Aydin

    2011-01-01

    Virus proteins alter protein pathways of the host toward the synthesis of viral particles by breaking and making edges via binding to host proteins. In this study, we developed a computational approach to predict viral sequence hotspots for binding to host proteins based on sequences of viral and host proteins and literature-curated virus-host protein interactome data. We use a motif discovery algorithm repeatedly on collections of sequences of viral proteins and immediate binding partners of their host targets and choose only those motifs that are conserved on viral sequences and highly statistically enriched among binding partners of virus protein targeted host proteins. Our results match experimental data on binding sites of Nef to host proteins such as MAPK1, VAV1, LCK, HCK, HLA-A, CD4, FYN, and GNB2L1 with high statistical significance but is a poor predictor of Nef binding sites on highly flexible, hoop-like regions. Predicted hotspots recapture CD8 cell epitopes of HIV Nef highlighting their importance in modulating virus-host interactions. Host proteins potentially targeted or outcompeted by Nef appear crowding the T cell receptor, natural killer cell mediated cytotoxicity, and neurotrophin signaling pathways. Scanning of HIV Nef motifs on multiple alignments of hepatitis C protein NS5A produces results consistent with literature, indicating the potential value of the hotspot discovery in advancing our understanding of virus-host crosstalk. PMID:21738584

  9. Seed storage protein gene promoters contain conserved DNA motifs in Brassicaceae, Fabaceae and Poaceae

    PubMed Central

    Fauteux, François; Strömvik, Martina V

    2009-01-01

    Background Accurate computational identification of cis-regulatory motifs is difficult, particularly in eukaryotic promoters, which typically contain multiple short and degenerate DNA sequences bound by several interacting factors. Enrichment in combinations of rare motifs in the promoter sequence of functionally or evolutionarily related genes among several species is an indicator of conserved transcriptional regulatory mechanisms. This provides a basis for the computational identification of cis-regulatory motifs. Results We have used a discriminative seeding DNA motif discovery algorithm for an in-depth analysis of 54 seed storage protein (SSP) gene promoters from three plant families, namely Brassicaceae (mustards), Fabaceae (legumes) and Poaceae (grasses) using backgrounds based on complete sets of promoters from a representative species in each family, namely Arabidopsis (Arabidopsis thaliana (L.) Heynh.), soybean (Glycine max (L.) Merr.) and rice (Oryza sativa L.) respectively. We have identified three conserved motifs (two RY-like and one ACGT-like) in Brassicaceae and Fabaceae SSP gene promoters that are similar to experimentally characterized seed-specific cis-regulatory elements. Fabaceae SSP gene promoter sequences are also enriched in a novel, seed-specific E2Fb-like motif. Conserved motifs identified in Poaceae SSP gene promoters include a GCN4-like motif, two prolamin-box-like motifs and an Skn-1-like motif. Evidence of the presence of a variant of the TATA-box is found in the SSP gene promoters from the three plant families. Motifs discovered in SSP gene promoters were used to score whole-genome sets of promoters from Arabidopsis, soybean and rice. The highest-scoring promoters are associated with genes coding for different subunits or precursors of seed storage proteins. Conclusion Seed storage protein gene promoter motifs are conserved in diverse species, and different plant families are characterized by a distinct combination of conserved motifs. The majority of discovered motifs match experimentally characterized cis-regulatory elements. These results provide a good starting point for further experimental analysis of plant seed-specific promoters and our methodology can be used to unravel more transcriptional regulatory mechanisms in plants and other eukaryotes. PMID:19843335

  10. Discriminative motif discovery via simulated evolution and random under-sampling.

    PubMed

    Song, Tao; Gu, Hong

    2014-01-01

    Conserved motifs in biological sequences are closely related to their structure and functions. Recently, discriminative motif discovery methods have attracted more and more attention. However, little attention has been devoted to the data imbalance problem, which is one of the main reasons affecting the performance of the discriminative models. In this article, a simulated evolution method is applied to solve the multi-class imbalance problem at the stage of data preprocessing, and at the stage of Hidden Markov Models (HMMs) training, a random under-sampling method is introduced for the imbalance between the positive and negative datasets. It is shown that, in the task of discovering targeting motifs of nine subcellular compartments, the motifs found by our method are more conserved than the methods without considering data imbalance problem and recover the most known targeting motifs from Minimotif Miner and InterPro. Meanwhile, we use the found motifs to predict protein subcellular localization and achieve higher prediction precision and recall for the minority classes.

  11. De-novo discovery of differentially abundant transcription factor binding sites including their positional preference.

    PubMed

    Keilwagen, Jens; Grau, Jan; Paponov, Ivan A; Posch, Stefan; Strickert, Marc; Grosse, Ivo

    2011-02-10

    Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.

  12. Motif finding in DNA sequences based on skipping nonconserved positions in background Markov chains.

    PubMed

    Zhao, Xiaoyan; Sze, Sing-Hoi

    2011-05-01

    One strategy to identify transcription factor binding sites is through motif finding in upstream DNA sequences of potentially co-regulated genes. Despite extensive efforts, none of the existing algorithms perform very well. We consider a string representation that allows arbitrary ignored positions within the nonconserved portion of single motifs, and use O(2(l)) Markov chains to model the background distributions of motifs of length l while skipping these positions within each Markov chain. By focusing initially on positions that have fixed nucleotides to define core occurrences, we develop an algorithm to identify motifs of moderate lengths. We compare the performance of our algorithm to other motif finding algorithms on a few benchmark data sets, and show that significant improvement in accuracy can be obtained when the sites are sufficiently conserved within a given sample, while comparable performance is obtained when the site conservation rate is low. A software program (PosMotif ) and detailed results are available online at http://faculty.cse.tamu.edu/shsze/posmotif.

  13. miRvestigator: web application to identify miRNAs responsible for co-regulated gene expression patterns discovered through transcriptome profiling.

    PubMed

    Plaisier, Christopher L; Bare, J Christopher; Baliga, Nitin S

    2011-07-01

    Transcriptome profiling studies have produced staggering numbers of gene co-expression signatures for a variety of biological systems. A significant fraction of these signatures will be partially or fully explained by miRNA-mediated targeted transcript degradation. miRvestigator takes as input lists of co-expressed genes from Caenorhabditis elegans, Drosophila melanogaster, G. gallus, Homo sapiens, Mus musculus or Rattus norvegicus and identifies the specific miRNAs that are likely to bind to 3' un-translated region (UTR) sequences to mediate the observed co-regulation. The novelty of our approach is the miRvestigator hidden Markov model (HMM) algorithm which systematically computes a similarity P-value for each unique miRNA seed sequence from the miRNA database miRBase to an overrepresented sequence motif identified within the 3'-UTR of the query genes. We have made this miRNA discovery tool accessible to the community by integrating our HMM algorithm with a proven algorithm for de novo discovery of miRNA seed sequences and wrapping these algorithms into a user-friendly interface. Additionally, the miRvestigator web server also produces a list of putative miRNA binding sites within 3'-UTRs of the query transcripts to facilitate the design of validation experiments. The miRvestigator is freely available at http://mirvestigator.systemsbiology.net.

  14. Comprehensive human transcription factor binding site map for combinatory binding motifs discovery.

    PubMed

    Müller-Molina, Arnoldo J; Schöler, Hans R; Araúzo-Bravo, Marcos J

    2012-01-01

    To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%-20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory "DNA words." From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%-far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of "DNA words," newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.

  15. Comprehensive Human Transcription Factor Binding Site Map for Combinatory Binding Motifs Discovery

    PubMed Central

    Müller-Molina, Arnoldo J.; Schöler, Hans R.; Araúzo-Bravo, Marcos J.

    2012-01-01

    To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%–20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory “DNA words.” From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%—far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of “DNA words,” newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters. PMID:23209563

  16. Detection of core-periphery structure in networks based on 3-tuple motifs

    NASA Astrophysics Data System (ADS)

    Ma, Chuang; Xiang, Bing-Bing; Chen, Han-Shuang; Small, Michael; Zhang, Hai-Feng

    2018-05-01

    Detecting mesoscale structure, such as community structure, is of vital importance for analyzing complex networks. Recently, a new mesoscale structure, core-periphery (CP) structure, has been identified in many real-world systems. In this paper, we propose an effective algorithm for detecting CP structure based on a 3-tuple motif. In this algorithm, we first define a 3-tuple motif in terms of the patterns of edges as well as the property of nodes, and then a motif adjacency matrix is constructed based on the 3-tuple motif. Finally, the problem is converted to find a cluster that minimizes the smallest motif conductance. Our algorithm works well in different CP structures: including single or multiple CP structure, and local or global CP structures. Results on the synthetic and the empirical networks validate the high performance of our method.

  17. A relational extension of the notion of motifs: application to the common 3D protein substructures searching problem.

    PubMed

    Pisanti, Nadia; Soldano, Henry; Carpentier, Mathilde; Pothier, Joel

    2009-12-01

    The geometrical configurations of atoms in protein structures can be viewed as approximate relations among them. Then, finding similar common substructures within a set of protein structures belongs to a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of positions of the motifs. We will hence denote them as relational motifs. For this class of problems, we present an algorithm that is a suitable extension of the KMR paradigm and, in particular, of the KMRC as it uses a degenerate alphabet. Our algorithm contains several improvements that become especially useful when-as it is required for relational motifs-the inference is made by partially overlapping shorter motifs, rather than concatenating them. The efficiency, correctness and completeness of the algorithm is ensured by several non-trivial properties that are proven in this paper. The algorithm has been applied in the important field of protein common 3D substructure searching. The methods implemented have been tested on several examples of protein families such as serine proteases, globins and cytochromes P450 additionally. The detected motifs have been compared to those found by multiple structural alignments methods.

  18. Regulatory elements of Caenorhabditis elegans ribosomal protein genes

    PubMed Central

    2012-01-01

    Background Ribosomal protein genes (RPGs) are essential, tightly regulated, and highly expressed during embryonic development and cell growth. Even though their protein sequences are strongly conserved, their mechanism of regulation is not conserved across yeast, Drosophila, and vertebrates. A recent investigation of genomic sequences conserved across both nematode species and associated with different gene groups indicated the existence of several elements in the upstream regions of C. elegans RPGs, providing a new insight regarding the regulation of these genes in C. elegans. Results In this study, we performed an in-depth examination of C. elegans RPG regulation and found nine highly conserved motifs in the upstream regions of C. elegans RPGs using the motif discovery algorithm DME. Four motifs were partially similar to transcription factor binding sites from C. elegans, Drosophila, yeast, and human. One pair of these motifs was found to co-occur in the upstream regions of 250 transcripts including 22 RPGs. The distance between the two motifs displayed a complex frequency pattern that was related to their relative orientation. We tested the impact of three of these motifs on the expression of rpl-2 using a series of reporter gene constructs and showed that all three motifs are necessary to maintain the high natural expression level of this gene. One of the motifs was similar to the binding site of an orthologue of POP-1, and we showed that RNAi knockdown of pop-1 impacts the expression of rpl-2. We further determined the transcription start site of rpl-2 by 5’ RACE and found that the motifs lie 40–90 bases upstream of the start site. We also found evidence that a noncoding RNA, contained within the outron of rpl-2, is co-transcribed with rpl-2 and cleaved during trans-splicing. Conclusions Our results indicate that C. elegans RPGs are regulated by a complex novel series of regulatory elements that is evolutionarily distinct from those of all other species examined up until now. PMID:22928635

  19. G-quadruplex prediction in E. coli genome reveals a conserved putative G-quadruplex-Hairpin-Duplex switch.

    PubMed

    Kaplan, Oktay I; Berber, Burak; Hekim, Nezih; Doluca, Osman

    2016-11-02

    Many studies show that short non-coding sequences are widely conserved among regulatory elements. More and more conserved sequences are being discovered since the development of next generation sequencing technology. A common approach to identify conserved sequences with regulatory roles relies on topological changes such as hairpin formation at the DNA or RNA level. G-quadruplexes, non-canonical nucleic acid topologies with little established biological roles, are increasingly considered for conserved regulatory element discovery. Since the tertiary structure of G-quadruplexes is strongly dependent on the loop sequence which is disregarded by the generally accepted algorithm, we hypothesized that G-quadruplexes with similar topology and, indirectly, similar interaction patterns, can be determined using phylogenetic clustering based on differences in the loop sequences. Phylogenetic analysis of 52 G-quadruplex forming sequences in the Escherichia coli genome revealed two conserved G-quadruplex motifs with a potential regulatory role. Further analysis revealed that both motifs tend to form hairpins and G quadruplexes, as supported by circular dichroism studies. The phylogenetic analysis as described in this work can greatly improve the discovery of functional G-quadruplex structures and may explain unknown regulatory patterns. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. QuateXelero: An Accelerated Exact Network Motif Detection Algorithm

    PubMed Central

    Khakabimamaghani, Sahand; Sharafuddin, Iman; Dichter, Norbert; Koch, Ina; Masoudi-Nejad, Ali

    2013-01-01

    Finding motifs in biological, social, technological, and other types of networks has become a widespread method to gain more knowledge about these networks’ structure and function. However, this task is very computationally demanding, because it is highly associated with the graph isomorphism which is an NP problem (not known to belong to P or NP-complete subsets yet). Accordingly, this research is endeavoring to decrease the need to call NAUTY isomorphism detection method, which is the most time-consuming step in many existing algorithms. The work provides an extremely fast motif detection algorithm called QuateXelero, which has a Quaternary Tree data structure in the heart. The proposed algorithm is based on the well-known ESU (FANMOD) motif detection algorithm. The results of experiments on some standard model networks approve the overal superiority of the proposed algorithm, namely QuateXelero, compared with two of the fastest existing algorithms, G-Tries and Kavosh. QuateXelero is especially fastest in constructing the central data structure of the algorithm from scratch based on the input network. PMID:23874498

  1. Extracting DNA words based on the sequence features: non-uniform distribution and integrity.

    PubMed

    Li, Zhi; Cao, Hongyan; Cui, Yuehua; Zhang, Yanbo

    2016-01-25

    DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the "words" based only on the DNA sequences. We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract "DNA words" that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.

  2. RNA motif search with data-driven element ordering.

    PubMed

    Rampášek, Ladislav; Jimenez, Randi M; Lupták, Andrej; Vinař, Tomáš; Brejová, Broňa

    2016-05-18

    In this paper, we study the problem of RNA motif search in long genomic sequences. This approach uses a combination of sequence and structure constraints to uncover new distant homologs of known functional RNAs. The problem is NP-hard and is traditionally solved by backtracking algorithms. We have designed a new algorithm for RNA motif search and implemented a new motif search tool RNArobo. The tool enhances the RNAbob descriptor language, allowing insertions in helices, which enables better characterization of ribozymes and aptamers. A typical RNA motif consists of multiple elements and the running time of the algorithm is highly dependent on their ordering. By approaching the element ordering problem in a principled way, we demonstrate more than 100-fold speedup of the search for complex motifs compared to previously published tools. We have developed a new method for RNA motif search that allows for a significant speedup of the search of complex motifs that include pseudoknots. Such speed improvements are crucial at a time when the rate of DNA sequencing outpaces growth in computing. RNArobo is available at http://compbio.fmph.uniba.sk/rnarobo .

  3. Symmetry compression method for discovering network motifs.

    PubMed

    Wang, Jianxin; Huang, Yuannan; Wu, Fang-Xiang; Pan, Yi

    2012-01-01

    Discovering network motifs could provide a significant insight into systems biology. Interestingly, many biological networks have been found to have a high degree of symmetry (automorphism), which is inherent in biological network topologies. The symmetry due to the large number of basic symmetric subgraphs (BSSs) causes a certain redundant calculation in discovering network motifs. Therefore, we compress all basic symmetric subgraphs before extracting compressed subgraphs and propose an efficient decompression algorithm to decompress all compressed subgraphs without loss of any information. In contrast to previous approaches, the novel Symmetry Compression method for Motif Detection, named as SCMD, eliminates most redundant calculations caused by widespread symmetry of biological networks. We use SCMD to improve three notable exact algorithms and two efficient sampling algorithms. Results of all exact algorithms with SCMD are the same as those of the original algorithms, since SCMD is a lossless method. The sampling results show that the use of SCMD almost does not affect the quality of sampling results. For highly symmetric networks, we find that SCMD used in both exact and sampling algorithms can help get a remarkable speedup. Furthermore, SCMD enables us to find larger motifs in biological networks with notable symmetry than previously possible.

  4. RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections

    PubMed Central

    Jaeger, Sébastien; Thieffry, Denis

    2017-01-01

    Abstract Transcription factor (TF) databases contain multitudes of binding motifs (TFBMs) from various sources, from which non-redundant collections are derived by manual curation. The advent of high-throughput methods stimulated the production of novel collections with increasing numbers of motifs. Meta-databases, built by merging these collections, contain redundant versions, because available tools are not suited to automatically identify and explore biologically relevant clusters among thousands of motifs. Motif discovery from genome-scale data sets (e.g. ChIP-seq) also produces redundant motifs, hampering the interpretation of results. We present matrix-clustering, a versatile tool that clusters similar TFBMs into multiple trees, and automatically creates non-redundant TFBM collections. A feature unique to matrix-clustering is its dynamic visualisation of aligned TFBMs, and its capability to simultaneously treat multiple collections from various sources. We demonstrate that matrix-clustering considerably simplifies the interpretation of combined results from multiple motif discovery tools, and highlights biologically relevant variations of similar motifs. We also ran a large-scale application to cluster ∼11 000 motifs from 24 entire databases, showing that matrix-clustering correctly groups motifs belonging to the same TF families, and drastically reduced motif redundancy. matrix-clustering is integrated within the RSAT suite (http://rsat.eu/), accessible through a user-friendly web interface or command-line for its integration in pipelines. PMID:28591841

  5. Clustering and Candidate Motif Detection in Exosomal miRNAs by Application of Machine Learning Algorithms.

    PubMed

    Gaur, Pallavi; Chaturvedi, Anoop

    2017-07-22

    The clustering pattern and motifs give immense information about any biological data. An application of machine learning algorithms for clustering and candidate motif detection in miRNAs derived from exosomes is depicted in this paper. Recent progress in the field of exosome research and more particularly regarding exosomal miRNAs has led much bioinformatic-based research to come into existence. The information on clustering pattern and candidate motifs in miRNAs of exosomal origin would help in analyzing existing, as well as newly discovered miRNAs within exosomes. Along with obtaining clustering pattern and candidate motifs in exosomal miRNAs, this work also elaborates the usefulness of the machine learning algorithms that can be efficiently used and executed on various programming languages/platforms. Data were clustered and sequence candidate motifs were detected successfully. The results were compared and validated with some available web tools such as 'BLASTN' and 'MEME suite'. The machine learning algorithms for aforementioned objectives were applied successfully. This work elaborated utility of machine learning algorithms and language platforms to achieve the tasks of clustering and candidate motif detection in exosomal miRNAs. With the information on mentioned objectives, deeper insight would be gained for analyses of newly discovered miRNAs in exosomes which are considered to be circulating biomarkers. In addition, the execution of machine learning algorithms on various language platforms gives more flexibility to users to try multiple iterations according to their requirements. This approach can be applied to other biological data-mining tasks as well.

  6. STEME: A Robust, Accurate Motif Finder for Large Data Sets

    PubMed Central

    Reid, John E.; Wernisch, Lorenz

    2014-01-01

    Motif finding is a difficult problem that has been studied for over 20 years. Some older popular motif finders are not suitable for analysis of the large data sets generated by next-generation sequencing. We recently published an efficient approximation (STEME) to the EM algorithm that is at the core of many motif finders such as MEME. This approximation allows the EM algorithm to be applied to large data sets. In this work we describe several efficient extensions to STEME that are based on the MEME algorithm. Together with the original STEME EM approximation, these extensions make STEME a fully-fledged motif finder with similar properties to MEME. We discuss the difficulty of objectively comparing motif finders. We show that STEME performs comparably to existing prominent discriminative motif finders, DREME and Trawler, on 13 sets of transcription factor binding data in mouse ES cells. We demonstrate the ability of STEME to find long degenerate motifs which these discriminative motif finders do not find. As part of our method, we extend an earlier method due to Nagarajan et al. for the efficient calculation of motif E-values. STEME's source code is available under an open source license and STEME is available via a web interface. PMID:24625410

  7. PSSMSearch: a server for modeling, visualization, proteome-wide discovery and annotation of protein motif specificity determinants.

    PubMed

    Krystkowiak, Izabella; Manguy, Jean; Davey, Norman E

    2018-06-05

    There is a pressing need for in silico tools that can aid in the identification of the complete repertoire of protein binding (SLiMs, MoRFs, miniMotifs) and modification (moiety attachment/removal, isomerization, cleavage) motifs. We have created PSSMSearch, an interactive web-based tool for rapid statistical modeling, visualization, discovery and annotation of protein motif specificity determinants to discover novel motifs in a proteome-wide manner. PSSMSearch analyses proteomes for regions with significant similarity to a motif specificity determinant model built from a set of aligned motif-containing peptides. Multiple scoring methods are available to build a position-specific scoring matrix (PSSM) describing the motif specificity determinant model. This model can then be modified by a user to add prior knowledge of specificity determinants through an interactive PSSM heatmap. PSSMSearch includes a statistical framework to calculate the significance of specificity determinant model matches against a proteome of interest. PSSMSearch also includes the SLiMSearch framework's annotation, motif functional analysis and filtering tools to highlight relevant discriminatory information. Additional tools to annotate statistically significant shared keywords and GO terms, or experimental evidence of interaction with a motif-recognizing protein have been added. Finally, PSSM-based conservation metrics have been created for taxonomic range analyses. The PSSMSearch web server is available at http://slim.ucd.ie/pssmsearch/.

  8. RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections.

    PubMed

    Castro-Mondragon, Jaime Abraham; Jaeger, Sébastien; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques

    2017-07-27

    Transcription factor (TF) databases contain multitudes of binding motifs (TFBMs) from various sources, from which non-redundant collections are derived by manual curation. The advent of high-throughput methods stimulated the production of novel collections with increasing numbers of motifs. Meta-databases, built by merging these collections, contain redundant versions, because available tools are not suited to automatically identify and explore biologically relevant clusters among thousands of motifs. Motif discovery from genome-scale data sets (e.g. ChIP-seq) also produces redundant motifs, hampering the interpretation of results. We present matrix-clustering, a versatile tool that clusters similar TFBMs into multiple trees, and automatically creates non-redundant TFBM collections. A feature unique to matrix-clustering is its dynamic visualisation of aligned TFBMs, and its capability to simultaneously treat multiple collections from various sources. We demonstrate that matrix-clustering considerably simplifies the interpretation of combined results from multiple motif discovery tools, and highlights biologically relevant variations of similar motifs. We also ran a large-scale application to cluster ∼11 000 motifs from 24 entire databases, showing that matrix-clustering correctly groups motifs belonging to the same TF families, and drastically reduced motif redundancy. matrix-clustering is integrated within the RSAT suite (http://rsat.eu/), accessible through a user-friendly web interface or command-line for its integration in pipelines. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. Informative priors based on transcription factor structural class improve de novo motif discovery.

    PubMed

    Narlikar, Leelavati; Gordân, Raluca; Ohler, Uwe; Hartemink, Alexander J

    2006-07-15

    An important problem in molecular biology is to identify the locations at which a transcription factor (TF) binds to DNA, given a set of DNA sequences believed to be bound by that TF. In previous work, we showed that information in the DNA sequence of a binding site is sufficient to predict the structural class of the TF that binds it. In particular, this suggests that we can predict which locations in any DNA sequence are more likely to be bound by certain classes of TFs than others. Here, we argue that traditional methods for de novo motif finding can be significantly improved by adopting an informative prior probability that a TF binding site occurs at each sequence location. To demonstrate the utility of such an approach, we present priority, a powerful new de novo motif finding algorithm. Using data from TRANSFAC, we train three classifiers to recognize binding sites of basic leucine zipper, forkhead, and basic helix loop helix TFs. These classifiers are used to equip priority with three class-specific priors, in addition to a default prior to handle TFs of other classes. We apply priority and a number of popular motif finding programs to sets of yeast intergenic regions that are reported by ChIP-chip to be bound by particular TFs. priority identifies motifs the other methods fail to identify, and correctly predicts the structural class of the TF recognizing the identified binding sites. Supplementary material and code can be found at http://www.cs.duke.edu/~amink/.

  10. Chiral Alkyl Halides: Underexplored Motifs in Medicine

    PubMed Central

    Gál, Bálint; Bucher, Cyril; Burns, Noah Z.

    2016-01-01

    While alkyl halides are valuable intermediates in synthetic organic chemistry, their use as bioactive motifs in drug discovery and medicinal chemistry is rare in comparison. This is likely attributable to the common misconception that these compounds are merely non-specific alkylators in biological systems. A number of chlorinated compounds in the pharmaceutical and food industries, as well as a growing number of halogenated marine natural products showing unique bioactivity, illustrate the role that chiral alkyl halides can play in drug discovery. Through a series of case studies, we demonstrate in this review that these motifs can indeed be stable under physiological conditions, and that halogenation can enhance bioactivity through both steric and electronic effects. Our hope is that, by placing such compounds in the minds of the chemical community, they may gain more traction in drug discovery and inspire more synthetic chemists to develop methods for selective halogenation. PMID:27827902

  11. Searching for statistically significant regulatory modules.

    PubMed

    Bailey, Timothy L; Noble, William Stafford

    2003-10-01

    The regulatory machinery controlling gene expression is complex, frequently requiring multiple, simultaneous DNA-protein interactions. The rate at which a gene is transcribed may depend upon the presence or absence of a collection of transcription factors bound to the DNA near the gene. Locating transcription factor binding sites in genomic DNA is difficult because the individual sites are small and tend to occur frequently by chance. True binding sites may be identified by their tendency to occur in clusters, sometimes known as regulatory modules. We describe an algorithm for detecting occurrences of regulatory modules in genomic DNA. The algorithm, called mcast, takes as input a DNA database and a collection of binding site motifs that are known to operate in concert. mcast uses a motif-based hidden Markov model with several novel features. The model incorporates motif-specific p-values, thereby allowing scores from motifs of different widths and specificities to be compared directly. The p-value scoring also allows mcast to only accept motif occurrences with significance below a user-specified threshold, while still assigning better scores to motif occurrences with lower p-values. mcast can search long DNA sequences, modeling length distributions between motifs within a regulatory module, but ignoring length distributions between modules. The algorithm produces a list of predicted regulatory modules, ranked by E-value. We validate the algorithm using simulated data as well as real data sets from fruitfly and human. http://meme.sdsc.edu/MCAST/paper

  12. Detecting DNA regulatory motifs by incorporating positional trendsin information content

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kechris, Katherina J.; van Zwet, Erik; Bickel, Peter J.

    2004-05-04

    On the basis of the observation that conserved positions in transcription factor binding sites are often clustered together, we propose a simple extension to the model-based motif discovery methods. We assign position-specific prior distributions to the frequency parameters of the model, penalizing deviations from a specified conservation profile. Examples with both simulated and real data show that this extension helps discover motifs as the data become noisier or when there is a competing false motif.

  13. TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets.

    PubMed

    Dang, Louis T; Tondl, Markus; Chiu, Man Ho H; Revote, Jerico; Paten, Benedict; Tano, Vincent; Tokolyi, Alex; Besse, Florence; Quaife-Ryan, Greg; Cumming, Helen; Drvodelic, Mark J; Eichenlaub, Michael P; Hallab, Jeannette C; Stolper, Julian S; Rossello, Fernando J; Bogoyevitch, Marie A; Jans, David A; Nim, Hieu T; Porrello, Enzo R; Hudson, James E; Ramialison, Mirana

    2018-04-05

    A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57-74, 2012; Nat 507:462-70, 2014; Nat 507:455-61, 2014; Nat 518:317-30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563-5, 2007; Nat Protoc 5:323-34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. TrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au .

  14. GPS-ARM: Computational Analysis of the APC/C Recognition Motif by Predicting D-Boxes and KEN-Boxes

    PubMed Central

    Ren, Jian; Cao, Jun; Zhou, Yanhong; Yang, Qing; Xue, Yu

    2012-01-01

    Anaphase-promoting complex/cyclosome (APC/C), an E3 ubiquitin ligase incorporated with Cdh1 and/or Cdc20 recognizes and interacts with specific substrates, and faithfully orchestrates the proper cell cycle events by targeting proteins for proteasomal degradation. Experimental identification of APC/C substrates is largely dependent on the discovery of APC/C recognition motifs, e.g., the D-box and KEN-box. Although a number of either stringent or loosely defined motifs proposed, these motif patterns are only of limited use due to their insufficient powers of prediction. We report the development of a novel GPS-ARM software package which is useful for the prediction of D-boxes and KEN-boxes in proteins. Using experimentally identified D-boxes and KEN-boxes as the training data sets, a previously developed GPS (Group-based Prediction System) algorithm was adopted. By extensive evaluation and comparison, the GPS-ARM performance was found to be much better than the one using simple motifs. With this powerful tool, we predicted 4,841 potential D-boxes in 3,832 proteins and 1,632 potential KEN-boxes in 1,403 proteins from H. sapiens, while further statistical analysis suggested that both the D-box and KEN-box proteins are involved in a broad spectrum of biological processes beyond the cell cycle. In addition, with the co-localization information, we predicted hundreds of mitosis-specific APC/C substrates with high confidence. As the first computational tool for the prediction of APC/C-mediated degradation, GPS-ARM is a useful tool for information to be used in further experimental investigations. The GPS-ARM is freely accessible for academic researchers at: http://arm.biocuckoo.org. PMID:22479614

  15. MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data.

    PubMed

    Ozaki, Haruka; Iwasaki, Wataru

    2016-08-01

    As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions. Copyright © 2016 Elsevier Ltd. All rights reserved.

  16. kpLogo: positional k-mer analysis reveals hidden specificity in biological sequences

    PubMed Central

    2017-01-01

    Abstract Motifs of only 1–4 letters can play important roles when present at key locations within macromolecules. Because existing motif-discovery tools typically miss these position-specific short motifs, we developed kpLogo, a probability-based logo tool for integrated detection and visualization of position-specific ultra-short motifs from a set of aligned sequences. kpLogo also overcomes the limitations of conventional motif-visualization tools in handling positional interdependencies and utilizing ranked or weighted sequences increasingly available from high-throughput assays. kpLogo can be found at http://kplogo.wi.mit.edu/. PMID:28460012

  17. ProMotE: an efficient algorithm for counting independent motifs in uncertain network topologies.

    PubMed

    Ren, Yuanfang; Sarkar, Aisharjya; Kahveci, Tamer

    2018-06-26

    Identifying motifs in biological networks is essential in uncovering key functions served by these networks. Finding non-overlapping motif instances is however a computationally challenging task. The fact that biological interactions are uncertain events further complicates the problem, as it makes the existence of an embedding of a given motif an uncertain event as well. In this paper, we develop a novel method, ProMotE (Probabilistic Motif Embedding), to count non-overlapping embeddings of a given motif in probabilistic networks. We utilize a polynomial model to capture the uncertainty. We develop three strategies to scale our algorithm to large networks. Our experiments demonstrate that our method scales to large networks in practical time with high accuracy where existing methods fail. Moreover, our experiments on cancer and degenerative disease networks show that our method helps in uncovering key functional characteristics of biological networks.

  18. A novel swarm intelligence algorithm for finding DNA motifs.

    PubMed

    Lei, Chengwei; Ruan, Jianhua

    2009-01-01

    Discovering DNA motifs from co-expressed or co-regulated genes is an important step towards deciphering complex gene regulatory networks and understanding gene functions. Despite significant improvement in the last decade, it still remains one of the most challenging problems in computational molecular biology. In this work, we propose a novel motif finding algorithm that finds consensus patterns using a population-based stochastic optimisation technique called Particle Swarm Optimisation (PSO), which has been shown to be effective in optimising difficult multidimensional problems in continuous domains. We propose to use a word dissimilarity graph to remap the neighborhood structure of the solution space of DNA motifs, and propose a modification of the naive PSO algorithm to accommodate discrete variables. In order to improve efficiency, we also propose several strategies for escaping from local optima and for automatically determining the termination criteria. Experimental results on simulated challenge problems show that our method is both more efficient and more accurate than several existing algorithms. Applications to several sets of real promoter sequences also show that our approach is able to detect known transcription factor binding sites, and outperforms two of the most popular existing algorithms.

  19. Rapid motif compliance scoring with match weight sets.

    PubMed

    Venezia, D; O'Hara, P J

    1993-02-01

    Most current implementations of motif matching in biological sequences have sacrificed the generality of weight matrix scoring for shorter runtimes. The program MOTIF incorporates a weight matrix and a rapid, backtracking tree-search algorithm to score motif compliance with greatly enhanced performance while placing no constraints on the motif. In addition, any positions within a motif can be marked as 'inviolate', thereby requiring an exact match. MOTIF allows a choice of regular expression formats and can use both motif and sequence libraries as either targets or queries. Nucleic acid sequences can optionally be translated by MOTIF in any frame(s) and used against peptide motifs.

  20. A Bioinformatics Approach for Detecting Repetitive Nested Motifs using Pattern Matching.

    PubMed

    Romero, José R; Carballido, Jessica A; Garbus, Ingrid; Echenique, Viviana C; Ponzoni, Ignacio

    2016-01-01

    The identification of nested motifs in genomic sequences is a complex computational problem. The detection of these patterns is important to allow the discovery of transposable element (TE) insertions, incomplete reverse transcripts, deletions, and/or mutations. In this study, a de novo strategy for detecting patterns that represent nested motifs was designed based on exhaustive searches for pairs of motifs and combinatorial pattern analysis. These patterns can be grouped into three categories, motifs within other motifs, motifs flanked by other motifs, and motifs of large size. The methodology used in this study, applied to genomic sequences from the plant species Aegilops tauschii and Oryza sativa , revealed that it is possible to identify putative nested TEs by detecting these three types of patterns. The results were validated through BLAST alignments, which revealed the efficacy and usefulness of the new method, which is called Mamushka.

  1. BEAM web server: a tool for structural RNA motif discovery.

    PubMed

    Pietrosanto, Marco; Adinolfi, Marta; Casula, Riccardo; Ausiello, Gabriele; Ferrè, Fabrizio; Helmer-Citterich, Manuela

    2018-03-15

    RNA structural motif finding is a relevant problem that becomes computationally hard when working on high-throughput data (e.g. eCLIP, PAR-CLIP), often represented by thousands of RNA molecules. Currently, the BEAM server is the only web tool capable to handle tens of thousands of RNA in input with a motif discovery procedure that is only limited by the current secondary structure prediction accuracies. The recently developed method BEAM (BEAr Motifs finder) can analyze tens of thousands of RNA molecules and identify RNA secondary structure motifs associated to a measure of their statistical significance. BEAM is extremely fast thanks to the BEAR encoding that transforms each RNA secondary structure in a string of characters. BEAM also exploits the evolutionary knowledge contained in a substitution matrix of secondary structure elements, extracted from the RFAM database of families of homologous RNAs. The BEAM web server has been designed to streamline data pre-processing by automatically handling folding and encoding of RNA sequences, giving users a choice for the preferred folding program. The server provides an intuitive and informative results page with the list of secondary structure motifs identified, the logo of each motif, its significance, graphic representation and information about its position in the RNA molecules sharing it. The web server is freely available at http://beam.uniroma2.it/ and it is implemented in NodeJS and Python with all major browsers supported. marco.pietrosanto@uniroma2.it. Supplementary data are available at Bioinformatics online.

  2. DynaMIT: the dynamic motif integration toolkit

    PubMed Central

    Dassi, Erik; Quattrone, Alessandro

    2016-01-01

    De-novo motif search is a frequently applied bioinformatics procedure to identify and prioritize recurrent elements in sequences sets for biological investigation, such as the ones derived from high-throughput differential expression experiments. Several algorithms have been developed to perform motif search, employing widely different approaches and often giving divergent results. In order to maximize the power of these investigations and ultimately be able to draft solid biological hypotheses, there is the need for applying multiple tools on the same sequences and merge the obtained results. However, motif reporting formats and statistical evaluation methods currently make such an integration task difficult to perform and mostly restricted to specific scenarios. We thus introduce here the Dynamic Motif Integration Toolkit (DynaMIT), an extremely flexible platform allowing to identify motifs employing multiple algorithms, integrate them by means of a user-selected strategy and visualize results in several ways; furthermore, the platform is user-extendible in all its aspects. DynaMIT is freely available at http://cibioltg.bitbucket.org. PMID:26253738

  3. Discovery and validation of information theory-based transcription factor and cofactor binding site motifs.

    PubMed

    Lu, Ruipeng; Mucaki, Eliseos J; Rogan, Peter K

    2017-03-17

    Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. Large-scale identification of odorant-binding proteins and chemosensory proteins from expressed sequence tags in insects

    PubMed Central

    2009-01-01

    Background Insect odorant binding proteins (OBPs) and chemosensory proteins (CSPs) play an important role in chemical communication of insects. Gene discovery of these proteins is a time-consuming task. In recent years, expressed sequence tags (ESTs) of many insect species have accumulated, thus providing a useful resource for gene discovery. Results We have developed a computational pipeline to identify OBP and CSP genes from insect ESTs. In total, 752,841 insect ESTs were examined from 54 species covering eight Orders of Insecta. From these ESTs, 142 OBPs and 177 CSPs were identified, of which 117 OBPs and 129 CSPs are new. The complete open reading frames (ORFs) of 88 OBPs and 123 CSPs were obtained by electronic elongation. We randomly chose 26 OBPs from eight species of insects, and 21 CSPs from four species for RT-PCR validation. Twenty two OBPs and 16 CSPs were confirmed by RT-PCR, proving the efficiency and reliability of the algorithm. Together with all family members obtained from the NCBI (OBPs) or the UniProtKB (CSPs), 850 OBPs and 237 CSPs were analyzed for their structural characteristics and evolutionary relationship. Conclusions A large number of new OBPs and CSPs were found, providing the basis for deeper understanding of these proteins. In addition, the conserved motif and evolutionary analysis provide some new insights into the evolution of insect OBPs and CSPs. Motif pattern fine-tune the functions of OBPs and CSPs, leading to the minor difference in binding sex pheromone or plant volatiles in different insect Orders. PMID:20034407

  5. RNA Bricks—a database of RNA 3D motifs and their interactions

    PubMed Central

    Chojnowski, Grzegorz; Waleń, Tomasz; Bujnicki, Janusz M.

    2014-01-01

    The RNA Bricks database (http://iimcb.genesilico.pl/rnabricks), stores information about recurrent RNA 3D motifs and their interactions, found in experimentally determined RNA structures and in RNA–protein complexes. In contrast to other similar tools (RNA 3D Motif Atlas, RNA Frabase, Rloom) RNA motifs, i.e. ‘RNA bricks’ are presented in the molecular environment, in which they were determined, including RNA, protein, metal ions, water molecules and ligands. All nucleotide residues in RNA bricks are annotated with structural quality scores that describe real-space correlation coefficients with the electron density data (if available), backbone geometry and possible steric conflicts, which can be used to identify poorly modeled residues. The database is also equipped with an algorithm for 3D motif search and comparison. The algorithm compares spatial positions of backbone atoms of the user-provided query structure and of stored RNA motifs, without relying on sequence or secondary structure information. This enables the identification of local structural similarities among evolutionarily related and unrelated RNA molecules. Besides, the search utility enables searching ‘RNA bricks’ according to sequence similarity, and makes it possible to identify motifs with modified ribonucleotide residues at specific positions. PMID:24220091

  6. A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data

    PubMed Central

    2014-01-01

    Abstract ChIP-Seq (chromatin immunoprecipitation sequencing) has provided the advantage for finding motifs as ChIP-Seq experiments narrow down the motif finding to binding site locations. Recent motif finding tools facilitate the motif detection by providing user-friendly Web interface. In this work, we reviewed nine motif finding Web tools that are capable for detecting binding site motifs in ChIP-Seq data. We showed each motif finding Web tool has its own advantages for detecting motifs that other tools may not discover. We recommended the users to use multiple motif finding Web tools that implement different algorithms for obtaining significant motifs, overlapping resemble motifs, and non-overlapping motifs. Finally, we provided our suggestions for future development of motif finding Web tool that better assists researchers for finding motifs in ChIP-Seq data. Reviewers This article was reviewed by Prof. Sandor Pongor, Dr. Yuriy Gusev, and Dr. Shyam Prabhakar (nominated by Prof. Limsoon Wong). PMID:24555784

  7. SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data

    PubMed Central

    Dotu, Ivan; Adamson, Scott I.; Coleman, Benjamin; Fournier, Cyril; Ricart-Altimiras, Emma; Eyras, Eduardo

    2018-01-01

    RNA-protein binding is critical to gene regulation, controlling fundamental processes including splicing, translation, localization and stability, and aberrant RNA-protein interactions are known to play a role in a wide variety of diseases. However, molecular understanding of RNA-protein interactions remains limited; in particular, identification of RNA motifs that bind proteins has long been challenging, especially when such motifs depend on both sequence and structure. Moreover, although RNA binding proteins (RBPs) often contain more than one binding domain, algorithms capable of identifying more than one binding motif simultaneously have not been developed. In this paper we present a novel pipeline to determine binding peaks in crosslinking immunoprecipitation (CLIP) data, to discover multiple possible RNA sequence/structure motifs among them, and to experimentally validate such motifs. At the core is a new semi-automatic algorithm SARNAclust, the first unsupervised method to identify and deconvolve multiple sequence/structure motifs simultaneously. SARNAclust computes similarity between sequence/structure objects using a graph kernel, providing the ability to isolate the impact of specific features through the bulge graph formalism. Application of SARNAclust to synthetic data shows its capability of clustering 5 motifs at once with a V-measure value of over 0.95, while GraphClust achieves only a V-measure of 0.083 and RNAcontext cannot detect any of the motifs. When applied to existing eCLIP sets, SARNAclust finds known motifs for SLBP and HNRNPC and novel motifs for several other RBPs such as AGGF1, AKAP8L and ILF3. We demonstrate an experimental validation protocol, a targeted Bind-n-Seq-like high-throughput sequencing approach that relies on RNA inverse folding for oligo pool design, that can validate the components within the SLBP motif. Finally, we use this protocol to experimentally interrogate the SARNAclust motif predictions for protein ILF3. Our results support a newly identified partially double-stranded UUUUUGAGA motif similar to that known for the splicing factor HNRNPC. PMID:29596423

  8. In-Silico Identification Of Micro-Loops In Myelodysplastic Syndromes

    NASA Astrophysics Data System (ADS)

    Beck, Dominik; Brandl, Miriam; Pham, Tuan D.; Chang, Chung-Che; Zhou, Xiaobo

    2011-06-01

    Micro-loops are regulatory network motifs that leverage transcriptional and posttranscriptional control to effectively regulate the transcriptome. In this paper a regulatory network for Myelodysplastic Syndromes (MDSs) was constructed from the literature and publicly available data sources. The network was filtered using data from deep-sequencing of small RNAs, exon and microarrays. Motif discovery showed that micro-loops might exist in MDS. We further used the identified micro-loops and performed basic network analysis to identify the known disease gene RUNX1/AML, as well as miRNA family hsa-mir-181. This suggested that the concept of micro-loops can be applied to enhance disease gene identification and biomarker discovery.

  9. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

    PubMed Central

    Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S.; Sinha, Saurabh

    2011-01-01

    Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. PMID:21821659

  10. Motif Discovery in Speech: Application to Monitoring Alzheimer's Disease.

    PubMed

    Garrard, Peter; Nemes, Vanda; Nikolic, Dragana; Barney, Anna

    2017-01-01

    Perseveration - repetition of words, phrases or questions in speech - is commonly described in Alzheimer's disease (AD). Measuring perseveration is difficult, but may index cognitive performance, aiding diagnosis and disease monitoring. Continuous recording of speech would produce a large quantity of data requiring painstaking manual analysis, and risk violating patients' and others' privacy. A secure record and an automated approach to analysis are required. To record bone-conducted acoustic energy fluctuations from a subject's vocal apparatus using an accelerometer, to describe the recording and analysis stages in detail, and demonstrate that the approach is feasible in AD. Speech-related vibration was captured by an accelerometer, affixed above the temporomandibular joint. Healthy subjects read a script with embedded repetitions. Features were extracted from recorded signals and combined using Principal Component Analysis to obtain a one-dimensional representation of the feature vector. Motif discovery techniques were used to detect repeated segments. The equipment was tested in AD patients to determine device acceptability and recording quality. Comparison with the known location of embedded motifs suggests that, with appropriate parameter tuning, the motif discovery method can detect repetitions. The device was acceptable to patients and produced adequate signal quality in their home environments. We established that continuously recording bone-conducted speech and detecting perseverative patterns were both possible. In future studies we plan to associate the frequency of verbal repetitions with stage, progression and type of dementia. It is possible that the method could contribute to the assessment of disease-modifying treatments. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  11. Web server to identify similarity of amino acid motifs to compounds (SAAMCO).

    PubMed

    Casey, Fergal P; Davey, Norman E; Baran, Ivan; Varekova, Radka Svobodova; Shields, Denis C

    2008-07-01

    Protein-protein interactions are fundamental in mediating biological processes including metabolism, cell growth, and signaling. To be able to selectively inhibit or induce protein activity or complex formation is a key feature in controlling disease. For those situations in which protein-protein interactions derive substantial affinity from short linear peptide sequences, or motifs, we can develop search algorithms for peptidomimetic compounds that resemble the short peptide's structure but are not compromised by poor pharmacological properties. SAAMCO is a Web service ( http://bioware.ucd.ie/ approximately saamco) that facilitates the screening of motifs with known structures against bioactive compound databases. It is built on an algorithm that defines compound similarity based on the presence of appropriate amino acid side chain fragments and a favorable Root Mean Squared Deviation (RMSD) between compound and motif structure. The methodology is efficient as the available compound databases are preprocessed and fast regular expression searches filter potential matches before time-intensive 3D superposition is performed. The required input information is minimal, and the compound databases have been selected to maximize the availability of information on biological activity. "Hits" are accompanied with a visualization window and links to source database entries. Motif matching can be defined on partial or full similarity which will increase or reduce respectively the number of potential mimetic compounds. The Web server provides the functionality for rapid screening of known or putative interaction motifs against prepared compound libraries using a novel search algorithm. The tabulated results can be analyzed by linking to appropriate databases and by visualization.

  12. Syntactic structures in languages and biology.

    PubMed

    Horn, David

    2008-08-01

    Both natural languages and cell biology make use of one-dimensional encryption. Their investigation calls for syntactic deciphering of the text and semantic understanding of the resulting structures. Here we discuss recently published algorithms that allow for such searches: automatic distillation of structure (ADIOS) that is successful in discovering syntactic structures in linguistic texts and its motif extraction (MEX) component that can be used for uncovering motifs in DNA and protein sequences. The underlying principles of these syntactic algorithms and some of their results will be described.

  13. SANDPUMA: ensemble predictions of nonribosomal peptide chemistry reveal biosynthetic diversity across Actinobacteria.

    PubMed

    Chevrette, Marc G; Aicheler, Fabian; Kohlbacher, Oliver; Currie, Cameron R; Medema, Marnix H

    2017-10-15

    Nonribosomally synthesized peptides (NRPs) are natural products with widespread applications in medicine and biotechnology. Many algorithms have been developed to predict the substrate specificities of nonribosomal peptide synthetase adenylation (A) domains from DNA sequences, which enables prioritization and dereplication, and integration with other data types in discovery efforts. However, insufficient training data and a lack of clarity regarding prediction quality have impeded optimal use. Here, we introduce prediCAT, a new phylogenetics-inspired algorithm, which quantitatively estimates the degree of predictability of each A-domain. We then systematically benchmarked all algorithms on a newly gathered, independent test set of 434 A-domain sequences, showing that active-site-motif-based algorithms outperform whole-domain-based methods. Subsequently, we developed SANDPUMA, a powerful ensemble algorithm, based on newly trained versions of all high-performing algorithms, which significantly outperforms individual methods. Finally, we deployed SANDPUMA in a systematic investigation of 7635 Actinobacteria genomes, suggesting that NRP chemical diversity is much higher than previously estimated. SANDPUMA has been integrated into the widely used antiSMASH biosynthetic gene cluster analysis pipeline and is also available as an open-source, standalone tool. SANDPUMA is freely available at https://bitbucket.org/chevrm/sandpuma and as a docker image at https://hub.docker.com/r/chevrm/sandpuma/ under the GNU Public License 3 (GPL3). chevrette@wisc.edu or marnix.medema@wur.nl. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  14. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures

    PubMed Central

    Stark, Alexander; Lin, Michael F.; Kheradpour, Pouya; Pedersen, Jakob S.; Parts, Leopold; Carlson, Joseph W.; Crosby, Madeline A.; Rasmussen, Matthew D.; Roy, Sushmita; Deoras, Ameya N.; Ruby, J. Graham; Brennecke, Julius; Hodges, Emily; Hinrichs, Angie S.; Caspi, Anat; Paten, Benedict; Park, Seung-Won; Han, Mira V.; Maeder, Morgan L.; Polansky, Benjamin J.; Robson, Bryanne E.; Aerts, Stein; van Helden, Jacques; Hassan, Bassem; Gilbert, Donald G.; Eastman, Deborah A.; Rice, Michael; Weir, Michael; Hahn, Matthew W.; Park, Yongkyu; Dewey, Colin N.; Pachter, Lior; Kent, W. James; Haussler, David; Lai, Eric C.; Bartel, David P.; Hannon, Gregory J.; Kaufman, Thomas C.; Eisen, Michael B.; Clark, Andrew G.; Smith, Douglas; Celniker, Susan E.; Gelbart, William M.; Kellis, Manolis

    2008-01-01

    Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies. PMID:17994088

  15. Multiplexed Thiol Reactivity Profiling for Target Discovery of Electrophilic Natural Products.

    PubMed

    Tian, Caiping; Sun, Rui; Liu, Keke; Fu, Ling; Liu, Xiaoyu; Zhou, Wanqi; Yang, Yong; Yang, Jing

    2017-11-16

    Electrophilic groups, such as Michael acceptors, expoxides, are common motifs in natural products (NPs). Electrophilic NPs can act through covalent modification of cysteinyl thiols on functional proteins, and exhibit potent cytotoxicity and anti-inflammatory/cancer activities. Here we describe a new chemoproteomic strategy, termed multiplexed thiol reactivity profiling (MTRP), and its use in target discovery of electrophilic NPs. We demonstrate the utility of MTRP by identifying cellular targets of gambogic acid, an electrophilic NP that is currently under evaluation in clinical trials as anticancer agent. Moreover, MTRP enables simultaneous comparison of seven structurally diversified α,β-unsaturated γ-lactones, which provides insights into the relative proteomic reactivity and target preference of diverse structural scaffolds coupled to a common electrophilic motif and reveals various potential druggable targets with liganded cysteines. We anticipate that this new method for thiol reactivity profiling in a multiplexed manner will find broad application in redox biology and drug discovery. Copyright © 2017 Elsevier Ltd. All rights reserved.

  16. WordSeeker: concurrent bioinformatics software for discovering genome-wide patterns and word-based genomic signatures

    PubMed Central

    2010-01-01

    Background An important focus of genomic science is the discovery and characterization of all functional elements within genomes. In silico methods are used in genome studies to discover putative regulatory genomic elements (called words or motifs). Although a number of methods have been developed for motif discovery, most of them lack the scalability needed to analyze large genomic data sets. Methods This manuscript presents WordSeeker, an enumerative motif discovery toolkit that utilizes multi-core and distributed computational platforms to enable scalable analysis of genomic data. A controller task coordinates activities of worker nodes, each of which (1) enumerates a subset of the DNA word space and (2) scores words with a distributed Markov chain model. Results A comprehensive suite of performance tests was conducted to demonstrate the performance, speedup and efficiency of WordSeeker. The scalability of the toolkit enabled the analysis of the entire genome of Arabidopsis thaliana; the results of the analysis were integrated into The Arabidopsis Gene Regulatory Information Server (AGRIS). A public version of WordSeeker was deployed on the Glenn cluster at the Ohio Supercomputer Center. Conclusion WordSeeker effectively utilizes concurrent computing platforms to enable the identification of putative functional elements in genomic data sets. This capability facilitates the analysis of the large quantity of sequenced genomic data. PMID:21210985

  17. Detecting Statistically Significant Communities of Triangle Motifs in Undirected Networks

    DTIC Science & Technology

    2016-04-26

    REPORT TYPE Final 3. DATES COVERED (From - To) 15 Oct 2014 to 14 Jan 2015 4. TITLE AND SUBTITLE Detecting statistically significant clusters of...extend the work of Perry et al. [6] by developing a statistical framework that supports the detection of triangle motif-based clusters in complex...priori, the need for triangle motif-based clustering . 2. Developed an algorithm for clustering undirected networks, where the triangle con guration was

  18. Identification of GATC- and CCGG- recognizing Type II REases and their putative specificity-determining positions using Scan2S—a novel motif scan algorithm with optional secondary structure constraints

    PubMed Central

    Niv, Masha Y.; Skrabanek, Lucy; Roberts, Richard J.; Scheraga, Harold A.; Weinstein, Harel

    2008-01-01

    Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering. PMID:17972284

  19. Identification of GATC- and CCGG-recognizing Type II REases and their putative specificity-determining positions using Scan2S--a novel motif scan algorithm with optional secondary structure constraints.

    PubMed

    Niv, Masha Y; Skrabanek, Lucy; Roberts, Richard J; Scheraga, Harold A; Weinstein, Harel

    2008-05-01

    Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering.

  20. The discovery of 9/8-ribbons, β/γ-peptides with curved shapes governed by a combined configuration-conformation code.

    PubMed

    Grison, Claire M; Robin, Sylvie; Aitken, David J

    2015-11-21

    The de novo design of a β/γ-peptidic foldamer motif has led to the discovery of an unprecedented 9/8-ribbon featuring an uninterrupted alternating C9/C8 hydrogen-bonding network. The ribbons adopt partially curved topologies determined synchronistically by the β-residue configuration and the γ-residue conformation sets.

  1. Inherent limitations of probabilistic models for protein-DNA binding specificity

    PubMed Central

    Ruan, Shuxiang

    2017-01-01

    The specificities of transcription factors are most commonly represented with probabilistic models. These models provide a probability for each base occurring at each position within the binding site and the positions are assumed to contribute independently. The model is simple and intuitive and is the basis for many motif discovery algorithms. However, the model also has inherent limitations that prevent it from accurately representing true binding probabilities, especially for the highest affinity sites under conditions of high protein concentration. The limitations are not due to the assumption of independence between positions but rather are caused by the non-linear relationship between binding affinity and binding probability and the fact that independent normalization at each position skews the site probabilities. Generally probabilistic models are reasonably good approximations, but new high-throughput methods allow for biophysical models with increased accuracy that should be used whenever possible. PMID:28686588

  2. A Gibbs sampler for motif detection in phylogenetically close sequences

    NASA Astrophysics Data System (ADS)

    Siddharthan, Rahul; van Nimwegen, Erik; Siggia, Eric

    2004-03-01

    Genes are regulated by transcription factors that bind to DNA upstream of genes and recognize short conserved ``motifs'' in a random intergenic ``background''. Motif-finders such as the Gibbs sampler compare the probability of these short sequences being represented by ``weight matrices'' to the probability of their arising from the background ``null model'', and explore this space (analogous to a free-energy landscape). But closely related species may show conservation not because of functional sites but simply because they have not had sufficient time to diverge, so conventional methods will fail. We introduce a new Gibbs sampler algorithm that accounts for common ancestry when searching for motifs, while requiring minimal ``prior'' assumptions on the number and types of motifs, assessing the significance of detected motifs by ``tracking'' clusters that stay together. We apply this scheme to motif detection in sporulation-cycle genes in the yeast S. cerevisiae, using recent sequences of other closely-related Saccharomyces species.

  3. Local Higher-Order Graph Clustering

    PubMed Central

    Yin, Hao; Benson, Austin R.; Leskovec, Jure; Gleich, David F.

    2018-01-01

    Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology. PMID:29770258

  4. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison.

    PubMed

    Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S; Sinha, Saurabh

    2011-12-01

    Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. © The Author(s) 2011. Published by Oxford University Press.

  5. Discovery of candidate KEN-box motifs using cell cycle keyword enrichment combined with native disorder prediction and motif conservation.

    PubMed

    Michael, Sushama; Travé, Gilles; Ramu, Chenna; Chica, Claudia; Gibson, Toby J

    2008-02-15

    KEN-box-mediated target selection is one of the mechanisms used in the proteasomal destruction of mitotic cell cycle proteins via the APC/C complex. While annotating the Eukaryotic Linear Motif resource (ELM, http://elm.eu.org/), we found that KEN motifs were significantly enriched in human protein entries with cell cycle keywords in the UniProt/Swiss-Prot database-implying that KEN-boxes might be more common than reported. Matches to short linear motifs in protein database searches are not, per se, significant. KEN-box enrichment with cell cycle Gene Ontology terms suggests that collectively these motifs are functional but does not prove that any given instance is so. Candidates were surveyed for native disorder prediction using GlobPlot and IUPred and for motif conservation in homologues. Among >25 strong new candidates, the most notable are human HIPK2, CHFR, CDC27, Dab2, Upf2, kinesin Eg5, DNA Topoisomerase 1 and yeast Cdc5 and Swi5. A similar number of weaker candidates were present. These proteins have yet to be tested for APC/C targeted destruction, providing potential new avenues of research.

  6. SPLASH: structural pattern localization analysis by sequential histograms.

    PubMed

    Califano, A

    2000-04-01

    The discovery of sparse amino acid patterns that match repeatedly in a set of protein sequences is an important problem in computational biology. Statistically significant patterns, that is patterns that occur more frequently than expected, may identify regions that have been preserved by evolution and which may therefore play a key functional or structural role. Sparseness can be important because a handful of non-contiguous residues may play a key role, while others, in between, may be changed without significant loss of function or structure. Similar arguments may be applied to conserved DNA patterns. Available sparse pattern discovery algorithms are either inefficient or impose limitations on the type of patterns that can be discovered. This paper introduces a deterministic pattern discovery algorithm, called Splash, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Some examples of biologically interesting motifs discovered by Splash are reported for the histone I and for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used to systematically and exhaustively identify conserved regions in protein family sets. These can then be used to build accurate and sensitive PSSM or HMM models for sequence analysis. Splash is available to non-commercial research centers upon request, conditional on the signing of a test field agreement. acal@us.ibm.com, Splash main page http://www.research.ibm.com/splash

  7. KIRMES: kernel-based identification of regulatory modules in euchromatic sequences.

    PubMed

    Schultheiss, Sebastian J; Busch, Wolfgang; Lohmann, Jan U; Kohlbacher, Oliver; Rätsch, Gunnar

    2009-08-15

    Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules. We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets. The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/.

  8. Cloud-based MOTIFSIM: Detecting Similarity in Large DNA Motif Data Sets.

    PubMed

    Tran, Ngoc Tam L; Huang, Chun-Hsi

    2017-05-01

    We developed the cloud-based MOTIFSIM on Amazon Web Services (AWS) cloud. The tool is an extended version from our web-based tool version 2.0, which was developed based on a novel algorithm for detecting similarity in multiple DNA motif data sets. This cloud-based version further allows researchers to exploit the computing resources available from AWS to detect similarity in multiple large-scale DNA motif data sets resulting from the next-generation sequencing technology. The tool is highly scalable with expandable AWS.

  9. Searching RNA motifs and their intermolecular contacts with constraint networks.

    PubMed

    Thébault, P; de Givry, S; Schiex, T; Gaspin, C

    2006-09-01

    Searching RNA gene occurrences in genomic sequences is a task whose importance has been renewed by the recent discovery of numerous functional RNA, often interacting with other ligands. Even if several programs exist for RNA motif search, none exists that can represent and solve the problem of searching for occurrences of RNA motifs in interaction with other molecules. We present a constraint network formulation of this problem. RNA are represented as structured motifs that can occur on more than one sequence and which are related together by possible hybridization. The implemented tool MilPat is used to search for several sRNA families in genomic sequences. Results show that MilPat allows to efficiently search for interacting motifs in large genomic sequences and offers a simple and extensible framework to solve such problems. New and known sRNA are identified as H/ACA candidates in Methanocaldococcus jannaschii. http://carlit.toulouse.inra.fr/MilPaT/MilPat.pl.

  10. CPI motif interaction is necessary for capping protein function in cells

    PubMed Central

    Edwards, Marc; McConnell, Patrick; Schafer, Dorothy A.; Cooper, John A.

    2015-01-01

    Capping protein (CP) has critical roles in actin assembly in vivo and in vitro. CP binds with high affinity to the barbed end of actin filaments, blocking the addition and loss of actin subunits. Heretofore, models for actin assembly in cells generally assumed that CP is constitutively active, diffusing freely to find and cap barbed ends. However, CP can be regulated by binding of the ‘capping protein interaction' (CPI) motif, found in a diverse and otherwise unrelated set of proteins that decreases, but does not abolish, the actin-capping activity of CP and promotes uncapping in biochemical experiments. Here, we report that CP localization and the ability of CP to function in cells requires interaction with a CPI-motif-containing protein. Our discovery shows that cells target and/or modulate the capping activity of CP via CPI motif interactions in order for CP to localize and function in cells. PMID:26412145

  11. A flexible motif search technique based on generalized profiles.

    PubMed

    Bucher, P; Karplus, K; Moeri, N; Hofmann, K

    1996-03-01

    A flexible motif search technique is presented which has two major components: (1) a generalized profile syntax serving as a motif definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functions of a variety of motif descriptors implemented in other methods, including regular expression-like patterns, weight matrices, previously used profiles, and certain types of hidden Markov models (HMMs). The relationship between generalized profiles and other biomolecular motif descriptors is analyzed in detail, with special attention to HMMs. Generalized profiles are shown to be equivalent to a particular class of HMMs, and conversion procedures in both directions are given. The conversion procedures provide an interpretation for local alignment in the framework of stochastic models, allowing for clear, simple significance tests. A mathematical statement of the motif search problem defines the new method exactly without linking it to a specific algorithmic solution. Part of the definition includes a new definition of disjointness of alignments.

  12. Integration of Bioinformatics and Synthetic Promoters Leads to the Discovery of Novel Elicitor-Responsive cis-Regulatory Sequences in Arabidopsis1[C][W][OA

    PubMed Central

    Koschmann, Jeannette; Machens, Fabian; Becker, Marlies; Niemeyer, Julia; Schulze, Jutta; Bülow, Lorenz; Stahl, Dietmar J.; Hehl, Reinhard

    2012-01-01

    A combination of bioinformatic tools, high-throughput gene expression profiles, and the use of synthetic promoters is a powerful approach to discover and evaluate novel cis-sequences in response to specific stimuli. With Arabidopsis (Arabidopsis thaliana) microarray data annotated to the PathoPlant database, 732 different queries with a focus on fungal and oomycete pathogens were performed, leading to 510 up-regulated gene groups. Using the binding site estimation suite of tools, BEST, 407 conserved sequence motifs were identified in promoter regions of these coregulated gene sets. Motif similarities were determined with STAMP, classifying the 407 sequence motifs into 37 families. A comparative analysis of these 37 families with the AthaMap, PLACE, and AGRIS databases revealed similarities to known cis-elements but also led to the discovery of cis-sequences not yet implicated in pathogen response. Using a parsley (Petroselinum crispum) protoplast system and a modified reporter gene vector with an internal transformation control, 25 elicitor-responsive cis-sequences from 10 different motif families were identified. Many of the elicitor-responsive cis-sequences also drive reporter gene expression in an Agrobacterium tumefaciens infection assay in Nicotiana benthamiana. This work significantly increases the number of known elicitor-responsive cis-sequences and demonstrates the successful integration of a diverse set of bioinformatic resources combined with synthetic promoter analysis for data mining and functional screening in plant-pathogen interaction. PMID:22744985

  13. A naturally occurring, noncanonical GTP aptamer made of simple tandem repeats

    PubMed Central

    Curtis, Edward A; Liu, David R

    2014-01-01

    Recently, we used in vitro selection to identify a new class of naturally occurring GTP aptamer called the G motif. Here we report the discovery and characterization of a second class of naturally occurring GTP aptamer, the “CA motif.” The primary sequence of this aptamer is unusual in that it consists entirely of tandem repeats of CA-rich motifs as short as three nucleotides. Several active variants of the CA motif aptamer lack the ability to form consecutive Watson-Crick base pairs in any register, while others consist of repeats containing only cytidine and adenosine residues, indicating that noncanonical interactions play important roles in its structure. The circular dichroism spectrum of the CA motif aptamer is distinct from that of A-form RNA and other major classes of nucleic acid structures. Bioinformatic searches indicate that the CA motif is absent from most archaeal and bacterial genomes, but occurs in at least 70 percent of approximately 400 eukaryotic genomes examined. These searches also uncovered several phylogenetically conserved examples of the CA motif in rodent (mouse and rat) genomes. Together, these results reveal the existence of a second class of naturally occurring GTP aptamer whose sequence requirements, like that of the G motif, are not consistent with those of a canonical secondary structure. They also indicate a new and unexpected potential biochemical activity of certain naturally occurring tandem repeats. PMID:24824832

  14. Automated Recognition of RNA Structure Motifs by Their SHAPE Data Signatures.

    PubMed

    Radecki, Pierce; Ledda, Mirko; Aviran, Sharon

    2018-06-14

    High-throughput structure profiling (SP) experiments that provide information at nucleotide resolution are revolutionizing our ability to study RNA structures. Of particular interest are RNA elements whose underlying structures are necessary for their biological functions. We previously introduced patteRNA , an algorithm for rapidly mining SP data for patterns characteristic of such motifs. This work provided a proof-of-concept for the detection of motifs and the capability of distinguishing structures displaying pronounced conformational changes. Here, we describe several improvements and automation routines to patteRNA . We then consider more elaborate biological situations starting with the comparison or integration of results from searches for distinct motifs and across datasets. To facilitate such analyses, we characterize patteRNA ’s outputs and describe a normalization framework that regularizes results. We then demonstrate that our algorithm successfully discerns between highly similar structural variants of the human immunodeficiency virus type 1 (HIV-1) Rev response element (RRE) and readily identifies its exact location in whole-genome structure profiles of HIV-1. This work highlights the breadth of information that can be gleaned from SP data and broadens the utility of data-driven methods as tools for the detection of novel RNA elements.

  15. D-MATRIX: A web tool for constructing weight matrix of conserved DNA motifs

    PubMed Central

    Sen, Naresh; Mishra, Manoj; Khan, Feroz; Meena, Abha; Sharma, Ashok

    2009-01-01

    Despite considerable efforts to date, DNA motif prediction in whole genome remains a challenge for researchers. Currently the genome wide motif prediction tools required either direct pattern sequence (for single motif) or weight matrix (for multiple motifs). Although there are known motif pattern databases and tools for genome level prediction but no tool for weight matrix construction. Considering this, we developed a D-MATRIX tool which predicts the different types of weight matrix based on user defined aligned motif sequence set and motif width. For retrieval of known motif sequences user can access the commonly used databases such as TFD, RegulonDB, DBTBS, Transfac. D­MATRIX program uses a simple statistical approach for weight matrix construction, which can be converted into different file formats according to user requirement. It provides the possibility to identify the conserved motifs in the co­regulated genes or whole genome. As example, we successfully constructed the weight matrix of LexA transcription factor binding site with the help of known sos­box cis­regulatory elements in Deinococcus radiodurans genome. The algorithm is implemented in C-Sharp and wrapped in ASP.Net to maintain a user friendly web interface. D­MATRIX tool is accessible through the CIMAP domain network. Availability http://203.190.147.116/dmatrix/ PMID:19759861

  16. D-MATRIX: a web tool for constructing weight matrix of conserved DNA motifs.

    PubMed

    Sen, Naresh; Mishra, Manoj; Khan, Feroz; Meena, Abha; Sharma, Ashok

    2009-07-27

    Despite considerable efforts to date, DNA motif prediction in whole genome remains a challenge for researchers. Currently the genome wide motif prediction tools required either direct pattern sequence (for single motif) or weight matrix (for multiple motifs). Although there are known motif pattern databases and tools for genome level prediction but no tool for weight matrix construction. Considering this, we developed a D-MATRIX tool which predicts the different types of weight matrix based on user defined aligned motif sequence set and motif width. For retrieval of known motif sequences user can access the commonly used databases such as TFD, RegulonDB, DBTBS, Transfac. D-MATRIX program uses a simple statistical approach for weight matrix construction, which can be converted into different file formats according to user requirement. It provides the possibility to identify the conserved motifs in the co-regulated genes or whole genome. As example, we successfully constructed the weight matrix of LexA transcription factor binding site with the help of known sos-box cis-regulatory elements in Deinococcus radiodurans genome. The algorithm is implemented in C-Sharp and wrapped in ASP.Net to maintain a user friendly web interface. D-MATRIX tool is accessible through the CIMAP domain network. http://203.190.147.116/dmatrix/

  17. A motif detection and classification method for peptide sequences using genetic programming.

    PubMed

    Tomita, Yasuyuki; Kato, Ryuji; Okochi, Mina; Honda, Hiroyuki

    2008-08-01

    An exploration of common rules (property motifs) in amino acid sequences has been required for the design of novel sequences and elucidation of the interactions between molecules controlled by the structural or physical environment. In the present study, we developed a new method to search property motifs that are common in peptide sequence data. Our method comprises the following two characteristics: (i) the automatic determination of the position and length of common property motifs by calculating the physicochemical similarity of amino acids, and (ii) the quick and effective exploration of motif candidates that discriminates the positives and negatives by the introduction of genetic programming (GP). Our method was evaluated by two types of model data sets. First, the intentionally buried property motifs were searched in the artificially derived peptide data containing intentionally buried property motifs. As a result, the expected property motifs were correctly extracted by our algorithm. Second, the peptide data that interact with MHC class II molecules were analyzed as one of the models of biologically active peptides with buried motifs in various lengths. Twofold MHC class II binding peptides were identified with the rule using our method, compared to the existing scoring matrix method. In conclusion, our GP based motif searching approach enabled to obtain knowledge of functional aspects of the peptides without any prior knowledge.

  18. An algorithm of discovering signatures from DNA databases on a computer cluster.

    PubMed

    Lee, Hsiao Ping; Sheu, Tzu-Fang

    2014-10-05

    Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved. In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms. The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.

  19. Protein Structure Determination by Assembling Super-Secondary Structure Motifs Using Pseudocontact Shifts.

    PubMed

    Pilla, Kala Bharath; Otting, Gottfried; Huber, Thomas

    2017-03-07

    Computational and nuclear magnetic resonance hybrid approaches provide efficient tools for 3D structure determination of small proteins, but currently available algorithms struggle to perform with larger proteins. Here we demonstrate a new computational algorithm that assembles the 3D structure of a protein from its constituent super-secondary structural motifs (Smotifs) with the help of pseudocontact shift (PCS) restraints for backbone amide protons, where the PCSs are produced from different metal centers. The algorithm, DINGO-PCS (3D assembly of Individual Smotifs to Near-native Geometry as Orchestrated by PCSs), employs the PCSs to recognize, orient, and assemble the constituent Smotifs of the target protein without any other experimental data or computational force fields. Using a universal Smotif database, the DINGO-PCS algorithm exhaustively enumerates any given Smotif. We benchmarked the program against ten different protein targets ranging from 100 to 220 residues with different topologies. For nine of these targets, the method was able to identify near-native Smotifs. Copyright © 2017 Elsevier Ltd. All rights reserved.

  20. Discovering Motifs in Biological Sequences Using the Micron Automata Processor.

    PubMed

    Roy, Indranil; Aluru, Srinivas

    2016-01-01

    Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.

  1. Automated Discovery of Long Intergenic RNAs Associated with Breast Cancer Progression

    DTIC Science & Technology

    2012-02-01

    manuscript in preparation), (2) development and publication of an algorithm for detecting gene fusions in RNA-Seq data [1], and (3) discovery of outlier long...subjected to de novo assembly algorithms to discover novel transcripts representing either unannotated genes or novel somatic mutations such as gene...fusions. To this end the P.I. developed and published a novel algorithm called ChimeraScan to facilitate the discovery and validation of gene

  2. DataWarrior: an open-source program for chemistry aware data visualization and analysis.

    PubMed

    Sander, Thomas; Freyss, Joel; von Korff, Modest; Rufener, Christian

    2015-02-23

    Drug discovery projects in the pharmaceutical industry accumulate thousands of chemical structures and ten-thousands of data points from a dozen or more biological and pharmacological assays. A sufficient interpretation of the data requires understanding, which molecular families are present, which structural motifs correlate with measured properties, and which tiny structural changes cause large property changes. Data visualization and analysis software with sufficient chemical intelligence to support chemists in this task is rare. In an attempt to contribute to filling the gap, we released our in-house developed chemistry aware data analysis program DataWarrior for free public use. This paper gives an overview of DataWarrior's functionality and architecture. Exemplarily, a new unsupervised, 2-dimensional scaling algorithm is presented, which employs vector-based or nonvector-based descriptors to visualize the chemical or pharmacophore space of even large data sets. DataWarrior uses this method to interactively explore chemical space, activity landscapes, and activity cliffs.

  3. MotifMark: Finding regulatory motifs in DNA sequences.

    PubMed

    Hassanzadeh, Hamid Reza; Kolhe, Pushkar; Isbell, Charles L; Wang, May D

    2017-07-01

    The interaction between proteins and DNA is a key driving force in a significant number of biological processes such as transcriptional regulation, repair, recombination, splicing, and DNA modification. The identification of DNA-binding sites and the specificity of target proteins in binding to these regions are two important steps in understanding the mechanisms of these biological activities. A number of high-throughput technologies have recently emerged that try to quantify the affinity between proteins and DNA motifs. Despite their success, these technologies have their own limitations and fall short in precise characterization of motifs, and as a result, require further downstream analysis to extract useful and interpretable information from a haystack of noisy and inaccurate data. Here we propose MotifMark, a new algorithm based on graph theory and machine learning, that can find binding sites on candidate probes and rank their specificity in regard to the underlying transcription factor. We developed a pipeline to analyze experimental data derived from compact universal protein binding microarrays and benchmarked it against two of the most accurate motif search methods. Our results indicate that MotifMark can be a viable alternative technique for prediction of motif from protein binding microarrays and possibly other related high-throughput techniques.

  4. New structures of Fe3S for rare-earth-free permanent magnets

    NASA Astrophysics Data System (ADS)

    Yu, Shu; Zhao, Xin; Wu, Shunqing; Nguyen, Manh Cuong; Zhu, Zi-zhong; Wang, Cai-Zhuang; Ho, Kai-Ming

    2018-02-01

    We applied an adaptive genetic algorithm (AGA) to search for low-energy crystal structures of Fe3S. A number of structures with energies lower than that of the experimentally reported Pnma and I-4 structures have been obtained from our AGA searches. These low-energy structures can be classified as layer-motif and column-motif structures. In the column-motif structures, Fe atoms self-assemble into rods with a bcc type of underlying lattice, which are separated by the holes terminated by S atoms. In the layer-motif structures, the bulk Fe is broken into slabs of several layers passivated by S atoms. Magnetic property calculations showed that the column-motif structures exhibit reasonably high uniaxial magnetic anisotropy. In addition, we examined the effect of Co doping to Fe3S and found that magnetic anisotropy can be enhanced through Co doping.

  5. Methods and statistics for combining motif match scores.

    PubMed

    Bailey, T L; Gribskov, M

    1998-01-01

    Position-specific scoring matrices are useful for representing and searching for protein sequence motifs. A sequence family can often be described by a group of one or more motifs, and an effective search must combine the scores for matching a sequence to each of the motifs in the group. We describe three methods for combining match scores and estimating the statistical significance of the combined scores and evaluate the search quality (classification accuracy) and the accuracy of the estimate of statistical significance of each. The three methods are: 1) sum of scores, 2) sum of reduced variates, 3) product of score p-values. We show that method 3) is superior to the other two methods in both regards, and that combining motif scores indeed gives better search accuracy. The MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading at URL http:/(/)www.sdsc.edu/MEME.

  6. Mutually Exclusive Formation of G-Quadruplex and i-Motif Is a General Phenomenon Governed by Steric Hindrance in Duplex DNA.

    PubMed

    Cui, Yunxi; Kong, Deming; Ghimire, Chiran; Xu, Cuixia; Mao, Hanbin

    2016-04-19

    G-Quadruplex and i-motif are tetraplex structures that may form in opposite strands at the same location of a duplex DNA. Recent discoveries have indicated that the two tetraplex structures can have conflicting biological activities, which poses a challenge for cells to coordinate. Here, by performing innovative population analysis on mechanical unfolding profiles of tetraplex structures in double-stranded DNA, we found that formations of G-quadruplex and i-motif in the two complementary strands are mutually exclusive in a variety of DNA templates, which include human telomere and promoter fragments of hINS and hTERT genes. To explain this behavior, we placed G-quadruplex- and i-motif-hosting sequences in an offset fashion in the two complementary telomeric DNA strands. We found simultaneous formation of the G-quadruplex and i-motif in opposite strands, suggesting that mutual exclusivity between the two tetraplexes is controlled by steric hindrance. This conclusion was corroborated in the BCL-2 promoter sequence, in which simultaneous formation of two tetraplexes was observed due to possible offset arrangements between G-quadruplex and i-motif in opposite strands. The mutual exclusivity revealed here sets a molecular basis for cells to efficiently coordinate opposite biological activities of G-quadruplex and i-motif at the same dsDNA location.

  7. Discovery of a Regulatory Motif for Human Satellite DNA Transcription in Response to BATF2 Overexpression.

    PubMed

    Bai, Xuejia; Huang, Wenqiu; Zhang, Chenguang; Niu, Jing; Ding, Wei

    2016-03-01

    One of the basic leucine zipper transcription factors, BATF2, has been found to suppress cancer growth and migration. However, little is known about the genes downstream of BATF2. HeLa cells were stably transfected with BATF2, then chromatin immunoprecipitation-sequencing was employed to identify the DNA motifs responsive to BATF2. Comprehensive bioinformatics analyses indicated that the most significant motif discovered as TTCCATT[CT]GATTCCATTC[AG]AT was primarily distributed among the chromosome centromere regions and mostly within human type II satellite DNA. Such motifs were able to prime the transcription of type II satellite DNA in a directional and asymmetrical manner. Consistently, satellite II transcription was up-regulated in BATF2-overexpressing cells. The present study provides insight into understanding the role of BATF2 in tumours and the importance of satellite DNA in the maintenance of genomic stability. Copyright© 2016 International Institute of Anticancer Research (Dr. John G. Delinassios), All rights reserved.

  8. iFORM: Incorporating Find Occurrence of Regulatory Motifs.

    PubMed

    Ren, Chao; Chen, Hebing; Yang, Bite; Liu, Feng; Ouyang, Zhangyi; Bo, Xiaochen; Shu, Wenjie

    2016-01-01

    Accurately identifying the binding sites of transcription factors (TFs) is crucial to understanding the mechanisms of transcriptional regulation and human disease. We present incorporating Find Occurrence of Regulatory Motifs (iFORM), an easy-to-use and efficient tool for scanning DNA sequences with TF motifs described as position weight matrices (PWMs). Both performance assessment with a receiver operating characteristic (ROC) curve and a correlation-based approach demonstrated that iFORM achieves higher accuracy and sensitivity by integrating five classical motif discovery programs using Fisher's combined probability test. We have used iFORM to provide accurate results on a variety of data in the ENCODE Project and the NIH Roadmap Epigenomics Project, and the tool has demonstrated its utility in further elucidating individual roles of functional elements. Both the source and binary codes for iFORM can be freely accessed at https://github.com/wenjiegroup/iFORM. The identified TF binding sites across human cell and tissue types using iFORM have been deposited in the Gene Expression Omnibus under the accession ID GSE53962.

  9. Neighbor Discovery Algorithm in Wireless Local Area Networks Using Multi-beam Directional Antennas

    NASA Astrophysics Data System (ADS)

    Wang, Jin; Peng, Wei; Liu, Song

    2017-10-01

    Neighbor discovery is an important step for Wireless Local Area Networks (WLAN) and the use of multi-beam directional antennas can greatly improve the network performance. However, most neighbor discovery algorithms in WLAN, based on multi-beam directional antennas, can only work effectively in synchronous system but not in asynchro-nous system. And collisions at AP remain a bottleneck for neighbor discovery. In this paper, we propose two asynchrono-us neighbor discovery algorithms: asynchronous hierarchical scanning (AHS) and asynchronous directional scanning (ADS) algorithm. Both of them are based on three-way handshaking mechanism. AHS and ADS reduce collisions at AP to have a good performance in a hierarchical way and directional way respectively. In the end, the performance of the AHS and ADS are tested on OMNeT++. Moreover, it is analyzed that different application scenarios and the factors how to affect the performance of these algorithms. The simulation results show that AHS is suitable for the densely populated scenes around AP while ADS is suitable for that most of the neighborhood nodes are far from AP.

  10. Genome-Wide Identification of Mitogen-Activated Protein Kinase Gene Family across Fungal Lineage Shows Presence of Novel and Diverse Activation Loop Motifs

    PubMed Central

    Mohanta, Tapan Kumar; Mohanta, Nibedita; Parida, Pratap; Panda, Sujogya Kumar; Ponpandian, Lakshmi Narayanan; Bae, Hanhong

    2016-01-01

    The mitogen-activated protein kinase (MAPK) is characterized by the presence of the T-E-Y, T-D-Y, and T-G-Y motifs in its activation loop region and plays a significant role in regulating diverse cellular responses in eukaryotic organisms. Availability of large-scale genome data in the fungal kingdom encouraged us to identify and analyse the fungal MAPK gene family consisting of 173 fungal species. The analysis of the MAPK gene family resulted in the discovery of several novel activation loop motifs (T-T-Y, T-I-Y, T-N-Y, T-H-Y, T-S-Y, K-G-Y, T-Q-Y, S-E-Y and S-D-Y) in fungal MAPKs. The phylogenetic analysis suggests that fungal MAPKs are non-polymorphic, had evolved from their common ancestors around 1500 million years ago, and are distantly related to plant MAPKs. We are the first to report the presence of nine novel activation loop motifs in fungal MAPKs. The specificity of the activation loop motif plays a significant role in controlling different growth and stress related pathways in fungi. Hence, the presences of these nine novel activation loop motifs in fungi are of special interest. PMID:26918378

  11. QuadBase2: web server for multiplexed guanine quadruplex mining and visualization

    PubMed Central

    Dhapola, Parashar; Chowdhury, Shantanu

    2016-01-01

    DNA guanine quadruplexes or G4s are non-canonical DNA secondary structures which affect genomic processes like replication, transcription and recombination. G4s are computationally identified by specific nucleotide motifs which are also called putative G4 (PG4) motifs. Despite the general relevance of these structures, there is currently no tool available that can allow batch queries and genome-wide analysis of these motifs in a user-friendly interface. QuadBase2 (quadbase.igib.res.in) presents a completely reinvented web server version of previously published QuadBase database. QuadBase2 enables users to mine PG4 motifs in up to 178 eukaryotes through the EuQuad module. This module interfaces with Ensembl Compara database, to allow users mine PG4 motifs in the orthologues of genes of interest across eukaryotes. PG4 motifs can be mined across genes and their promoter sequences in 1719 prokaryotes through ProQuad module. This module includes a feature that allows genome-wide mining of PG4 motifs and their visualization as circular histograms. TetraplexFinder, the module for mining PG4 motifs in user-provided sequences is now capable of handling up to 20 MB of data. QuadBase2 is a comprehensive PG4 motif mining tool that further expands the configurations and algorithms for mining PG4 motifs in a user-friendly way. PMID:27185890

  12. Gibbs motif sampling: detection of bacterial outer membrane protein repeats.

    PubMed Central

    Neuwald, A. F.; Liu, J. S.; Lawrence, C. E.

    1995-01-01

    The detection and alignment of locally conserved regions (motifs) in multiple sequences can provide insight into protein structure, function, and evolution. A new Gibbs sampling algorithm is described that detects motif-encoding regions in sequences and optimally partitions them into distinct motif models; this is illustrated using a set of immunoglobulin fold proteins. When applied to sequences sharing a single motif, the sampler can be used to classify motif regions into related submodels, as is illustrated using helix-turn-helix DNA-binding proteins. Other statistically based procedures are described for searching a database for sequences matching motifs found by the sampler. When applied to a set of 32 very distantly related bacterial integral outer membrane proteins, the sampler revealed that they share a subtle, repetitive motif. Although BLAST (Altschul SF et al., 1990, J Mol Biol 215:403-410) fails to detect significant pairwise similarity between any of the sequences, the repeats present in these outer membrane proteins, taken as a whole, are highly significant (based on a generally applicable statistical test for motifs described here). Analysis of bacterial porins with known trimeric beta-barrel structure and related proteins reveals a similar repetitive motif corresponding to alternating membrane-spanning beta-strands. These beta-strands occur on the membrane interface (as opposed to the trimeric interface) of the beta-barrel. The broad conservation and structural location of these repeats suggests that they play important functional roles. PMID:8520488

  13. Polyamine conjugation of curcumin analogues toward the discovery of mitochondria-directed neuroprotective agents.

    PubMed

    Simoni, Elena; Bergamini, Christian; Fato, Romana; Tarozzi, Andrea; Bains, Sandip; Motterlini, Roberto; Cavalli, Andrea; Bolognesi, Maria Laura; Minarini, Anna; Hrelia, Patrizia; Lenaz, Giorgio; Rosini, Michela; Melchiorre, Carlo

    2010-10-14

    Mitochondria-directed antioxidants 2-5 were designed by conjugating curcumin congeners with different polyamine motifs as vehicle tools. The conjugates emerged as efficient antioxidants in mitochondria and fibroblasts and also exerted a protecting role through heme oxygenase-1 activation. Notably, the insertion of a polyamine function into the curcumin-like moiety allowed an efficient intracellular uptake and mitochondria targeting. It also resulted in a significant decrease in the cytotoxicity effects. 2-5 are therefore promising molecules for neuroprotectant lead discovery.

  14. Improving the Accuracy and Scalability of Discriminative Learning Methods for Markov Logic Networks

    DTIC Science & Technology

    2011-05-01

    9 2.2 Inductive Logic Programming and Aleph . . . . . . . . . . . . 10 2.3 MLNs and Alchemy ...positive examples. Aleph allows users to customize each of 10 these steps, and thereby supports a variety of specific algorithms. 2.3 MLNs and Alchemy An...tural motifs. By limiting the search to each unique motif, LSM is able to find good clauses in an efficient manner. Alchemy (Kok, Singla, Richardson

  15. Interaction Analysis through Proteomic Phage Display

    PubMed Central

    2014-01-01

    Phage display is a powerful technique for profiling specificities of peptide binding domains. The method is suited for the identification of high-affinity ligands with inhibitor potential when using highly diverse combinatorial peptide phage libraries. Such experiments further provide consensus motifs for genome-wide scanning of ligands of potential biological relevance. A complementary but considerably less explored approach is to display expression products of genomic DNA, cDNA, open reading frames (ORFs), or oligonucleotide libraries designed to encode defined regions of a target proteome on phage particles. One of the main applications of such proteomic libraries has been the elucidation of antibody epitopes. This review is focused on the use of proteomic phage display to uncover protein-protein interactions of potential relevance for cellular function. The method is particularly suited for the discovery of interactions between peptide binding domains and their targets. We discuss the largely unexplored potential of this method in the discovery of domain-motif interactions of potential biological relevance. PMID:25295249

  16. RSAT 2018: regulatory sequence analysis tools 20th anniversary.

    PubMed

    Nguyen, Nga Thi Thuy; Contreras-Moreira, Bruno; Castro-Mondragon, Jaime A; Santana-Garcia, Walter; Ossio, Raul; Robles-Espinoza, Carla Daniela; Bahin, Mathieu; Collombet, Samuel; Vincens, Pierre; Thieffry, Denis; van Helden, Jacques; Medina-Rivera, Alejandra; Thomas-Chollier, Morgane

    2018-05-02

    RSAT (Regulatory Sequence Analysis Tools) is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. Six public servers jointly support 10 000 genomes from all kingdoms. Six novel or refactored programs have been added since the 2015 NAR Web Software Issue, including updated programs to analyse regulatory variants (retrieve-variation-seq, variation-scan, convert-variations), along with tools to extract sequences from a list of coordinates (retrieve-seq-bed), to select motifs from motif collections (retrieve-matrix), and to extract orthologs based on Ensembl Compara (get-orthologs-compara). Three use cases illustrate the integration of new and refactored tools to the suite. This Anniversary update gives a 20-year perspective on the software suite. RSAT is well-documented and available through Web sites, SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services, virtual machines and stand-alone programs at http://www.rsat.eu/.

  17. GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units

    PubMed Central

    Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

    2012-01-01

    Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a “fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/ PMID:22662128

  18. GPUmotif: an ultra-fast and energy-efficient motif analysis program using graphics processing units.

    PubMed

    Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

    2012-01-01

    Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a "fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/

  19. Unitary circular code motifs in genomes of eukaryotes.

    PubMed

    El Soufi, Karim; Michel, Christian J

    A set X of 20 trinucleotides was identified in genes of bacteria, eukaryotes, plasmids and viruses, which has in average the highest occurrence in reading frame compared to its two shifted frames (Michel, 2015; Arquès and Michel, 1996). This set X has an interesting mathematical property as X is a circular code (Arquès and Michel, 1996). Thus, the motifs from this circular code X, called X motifs, have the property to always retrieve, synchronize and maintain the reading frame in genes. The origin of this circular code X in genes is an open problem since its discovery in 1996. Here, we first show that the unitary circular codes (UCC), i.e. sets of one word, allow to generate unitary circular code motifs (UCC motifs), i.e. a concatenation of the same motif (simple repeats) leading to low complexity DNA. Three classes of UCC motifs are studied here: repeated dinucleotides (D + motifs), repeated trinucleotides (T + motifs) and repeated tetranucleotides (T + motifs). Thus, the D + , T + and T + motifs allow to retrieve, synchronize and maintain a frame modulo 2, modulo 3 and modulo 4, respectively, and their shifted frames (1 modulo 2; 1 and 2 modulo 3; 1, 2 and 3 modulo 4 according to the C 2 , C 3 and C 4 properties, respectively) in the DNA sequences. The statistical distribution of the D + , T + and T + motifs is analyzed in the genomes of eukaryotes. A UCC motif and its comp lementary UCC motif have the same distribution in the eukaryotic genomes. Furthermore, a UCC motif and its complementary UCC motif have increasing occurrences contrary to their number of hydrogen bonds, very significant with the T + motifs. The longest D + , T + and T + motifs in the studied eukaryotic genomes are also given. Surprisingly, a scarcity of repeated trinucleotides (T + motifs) in the large eukaryotic genomes is observed compared to the D + and T + motifs. This result has been investigated and may be explained by two outcomes. Repeated trinucleotides (T + motifs) are identified in the X motifs of low composition (cardinality less than 10) in the genomes of eukaryotes. Furthermore, identical trinucleotide pairs of the circular code X are preferentially used in the gene sequences of eukaryotes. These two results suggest that the unitary circular codes of trinucleotides may have been involved in the formation of the trinucleotide circular code X. Indeed, repeated trinucleotides in the X motifs in the genomes of eukaryotes may represent an intermediary evolution from repeated trinucleotides of cardinality 1 (T + motifs) in the genomes of eukaryotes up to the X motifs of cardinality 20 in the gene sequences of eukaryotes. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. BlockLogo: visualization of peptide and sequence motif conservation

    PubMed Central

    Olsen, Lars Rønn; Kudahl, Ulrich Johan; Simon, Christian; Sun, Jing; Schönbach, Christian; Reinherz, Ellis L.; Zhang, Guang Lan; Brusic, Vladimir

    2013-01-01

    BlockLogo is a web-server application for visualization of protein and nucleotide fragments, continuous protein sequence motifs, and discontinuous sequence motifs using calculation of block entropy from multiple sequence alignments. The user input consists of a multiple sequence alignment, selection of motif positions, type of sequence, and output format definition. The output has BlockLogo along with the sequence logo, and a table of motif frequencies. We deployed BlockLogo as an online application and have demonstrated its utility through examples that show visualization of T-cell epitopes and B-cell epitopes (both continuous and discontinuous). Our additional example shows a visualization and analysis of structural motifs that determine specificity of peptide binding to HLA-DR molecules. The BlockLogo server also employs selected experimentally validated prediction algorithms to enable on-the-fly prediction of MHC binding affinity to 15 common HLA class I and class II alleles as well as visual analysis of discontinuous epitopes from multiple sequence alignments. It enables the visualization and analysis of structural and functional motifs that are usually described as regular expressions. It provides a compact view of discontinuous motifs composed of distant positions within biological sequences. BlockLogo is available at: http://research4.dfci.harvard.edu/cvc/blocklogo/ and http://methilab.bu.edu/blocklogo/ PMID:24001880

  1. Detecting microsatellites within genomes: significant variation among algorithms.

    PubMed

    Leclercq, Sébastien; Rivals, Eric; Jarne, Philippe

    2007-04-18

    Microsatellites are short, tandemly-repeated DNA sequences which are widely distributed among genomes. Their structure, role and evolution can be analyzed based on exhaustive extraction from sequenced genomes. Several dedicated algorithms have been developed for this purpose. Here, we compared the detection efficiency of five of them (TRF, Mreps, Sputnik, STAR, and RepeatMasker). Our analysis was first conducted on the human X chromosome, and microsatellite distributions were characterized by microsatellite number, length, and divergence from a pure motif. The algorithms work with user-defined parameters, and we demonstrate that the parameter values chosen can strongly influence microsatellite distributions. The five algorithms were then compared by fixing parameters settings, and the analysis was extended to three other genomes (Saccharomyces cerevisiae, Neurospora crassa and Drosophila melanogaster) spanning a wide range of size and structure. Significant differences for all characteristics of microsatellites were observed among algorithms, but not among genomes, for both perfect and imperfect microsatellites. Striking differences were detected for short microsatellites (below 20 bp), regardless of motif. Since the algorithm used strongly influences empirical distributions, studies analyzing microsatellite evolution based on a comparison between empirical and theoretical size distributions should therefore be considered with caution. We also discuss why a typological definition of microsatellites limits our capacity to capture their genomic distributions.

  2. Detecting microsatellites within genomes: significant variation among algorithms

    PubMed Central

    Leclercq, Sébastien; Rivals, Eric; Jarne, Philippe

    2007-01-01

    Background Microsatellites are short, tandemly-repeated DNA sequences which are widely distributed among genomes. Their structure, role and evolution can be analyzed based on exhaustive extraction from sequenced genomes. Several dedicated algorithms have been developed for this purpose. Here, we compared the detection efficiency of five of them (TRF, Mreps, Sputnik, STAR, and RepeatMasker). Results Our analysis was first conducted on the human X chromosome, and microsatellite distributions were characterized by microsatellite number, length, and divergence from a pure motif. The algorithms work with user-defined parameters, and we demonstrate that the parameter values chosen can strongly influence microsatellite distributions. The five algorithms were then compared by fixing parameters settings, and the analysis was extended to three other genomes (Saccharomyces cerevisiae, Neurospora crassa and Drosophila melanogaster) spanning a wide range of size and structure. Significant differences for all characteristics of microsatellites were observed among algorithms, but not among genomes, for both perfect and imperfect microsatellites. Striking differences were detected for short microsatellites (below 20 bp), regardless of motif. Conclusion Since the algorithm used strongly influences empirical distributions, studies analyzing microsatellite evolution based on a comparison between empirical and theoretical size distributions should therefore be considered with caution. We also discuss why a typological definition of microsatellites limits our capacity to capture their genomic distributions. PMID:17442102

  3. Discovery of T Cell Receptor β Motifs Specific to HLA-B27-Positive Ankylosing Spondylitis by Deep Repertoire Sequence Analysis.

    PubMed

    Faham, Malek; Carlton, Victoria; Moorhead, Martin; Zheng, Jianbiao; Klinger, Mark; Pepin, Francois; Asbury, Thomas; Vignali, Marissa; Emerson, Ryan O; Robins, Harlan S; Ireland, James; Baechler-Gillespie, Emily; Inman, Robert D

    2017-04-01

    Ankylosing spondylitis (AS), a chronic inflammatory disorder, has a notable association with HLA-B27. One hypothesis suggests that a common antigen that binds to HLA-B27 is important for AS disease pathogenesis. This study was undertaken to determine sequences and motifs that are shared among HLA-B27-positive AS patients, using T cell repertoire next-generation sequencing. To identify motifs enriched among B27-positive AS patients, we performed T cell receptor β (TCRβ) repertoire sequencing on samples from 191 B27-positive AS patients, 43 B27-negative AS patients, and 227 controls, and we obtained >77 million TCRβ clonotype sequences. First, we assessed whether any of 50 previously published sequences were enriched in B27-positive AS patients. We then used training and test cohorts to identify discovered motifs that were enriched in B27-positive AS patients versus controls. Six previously published and 11 discovered motifs were enriched in the B27-positive AS samples as compared to controls. After combining motifs related by sequence, we identified a total of 15 independent motifs. Both the full set of 15 motifs and a set of 6 published motifs were enriched in the B27-positive AS patients as compared to B27-positive healthy individuals (P = 0.049 and P = 0.001, respectively). Using an independent cohort, we validated that at least some of these motifs were associated with AS, and not simply with B27-positive status. We identified TCRβ motifs that are enriched in B27-positive AS patients as compared to B27-positive healthy controls. This suggests that a common antigen, presented by HLA-B27 and detected by CD8+ T cells, may be associated with AS disease pathogenesis. © 2016, American College of Rheumatology.

  4. Onco-Regulon: an integrated database and software suite for site specific targeting of transcription factors of cancer genes

    PubMed Central

    Tomar, Navneet; Mishra, Akhilesh; Mrinal, Nirotpal; Jayaram, B.

    2016-01-01

    Transcription factors (TFs) bind at multiple sites in the genome and regulate expression of many genes. Regulating TF binding in a gene specific manner remains a formidable challenge in drug discovery because the same binding motif may be present at multiple locations in the genome. Here, we present Onco-Regulon (http://www.scfbio-iitd.res.in/software/onco/NavSite/index.htm), an integrated database of regulatory motifs of cancer genes clubbed with Unique Sequence-Predictor (USP) a software suite that identifies unique sequences for each of these regulatory DNA motifs at the specified position in the genome. USP works by extending a given DNA motif, in 5′→3′, 3′ →5′ or both directions by adding one nucleotide at each step, and calculates the frequency of each extended motif in the genome by Frequency Counter programme. This step is iterated till the frequency of the extended motif becomes unity in the genome. Thus, for each given motif, we get three possible unique sequences. Closest Sequence Finder program predicts off-target drug binding in the genome. Inclusion of DNA-Protein structural information further makes Onco-Regulon a highly informative repository for gene specific drug development. We believe that Onco-Regulon will help researchers to design drugs which will bind to an exclusive site in the genome with no off-target effects, theoretically. Database URL: http://www.scfbio-iitd.res.in/software/onco/NavSite/index.htm PMID:27515825

  5. A roadmap for natural product discovery based on large-scale genomics and metabolomics

    USDA-ARS?s Scientific Manuscript database

    Actinobacteria encode a wealth of natural product biosynthetic gene clusters, whose systematic study is complicated by numerous repetitive motifs. By combining several metrics we developed a method for global classification of these gene clusters into families (GCFs) and analyzed the biosynthetic ca...

  6. How Formal Methods Impels Discovery: A Short History of an Air Traffic Management Project

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.; Hagen, George; Maddalon, Jeffrey M.; Munoz, Cesar A.; Narkawicz, Anthony; Dowek, Gilles

    2010-01-01

    In this paper we describe a process of algorithmic discovery that was driven by our goal of achieving complete, mechanically verified algorithms that compute conflict prevention bands for use in en route air traffic management. The algorithms were originally defined in the PVS specification language and subsequently have been implemented in Java and C++. We do not present the proofs in this paper: instead, we describe the process of discovery and the key ideas that enabled the final formal proof of correctness

  7. Evolution of Drosophila ribosomal protein gene core promoters.

    PubMed

    Ma, Xiaotu; Zhang, Kangyu; Li, Xiaoman

    2009-03-01

    The coordinated expression of ribosomal protein genes (RPGs) has been well documented in many species. Previous analyses of RPG promoters focus only on Fungi and mammals. Recognizing this gap and using a comparative genomics approach, we utilize a motif-finding algorithm that incorporates cross-species conservation to identify several significant motifs in Drosophila RPG promoters. As a result, significant differences of the enriched motifs in RPG promoter are found among Drosophila, Fungi, and mammals, demonstrating the evolutionary dynamics of the ribosomal gene regulatory network. We also report a motif present in similar numbers of RPGs among Drosophila species which does not appear to be conserved at the individual RPG gene level. A module-wise stabilizing selection theory is proposed to explain this observation. Overall, our results provide significant insight into the fast-evolving nature of transcriptional regulation in the RPG module.

  8. Evolution of Drosophila ribosomal protein gene core promoters

    PubMed Central

    Ma, Xiaotu; Zhang, Kangyu; Li, Xiaoman

    2011-01-01

    The coordinated expression of ribosomal protein genes (RPGs) has been well documented in many species. Previous analyses of RPG promoters focus only on Fungi and mammals. Recognizing this gap and using a comparative genomics approach, we utilize a motif-finding algorithm that incorporates cross-species conservation to identify several significant motifs in Drosophila RPG promoters. As a result, significant differences of the enriched motifs in RPG promoter are found among Drosophila, Fungi, and mammals, demonstrating the evolutionary dynamics of the ribosomal gene regulatory network. We also report a motif present in similar numbers of RPGs among Drosophila species which does not appear to be conserved at the individual RPG gene level. A module-wise stabilizing selection theory is proposed to explain this observation. Overall, our results provide significant insight into the fast-evolving nature of transcriptional regulation in the RPG module. PMID:19059316

  9. Automatic annotation of protein motif function with Gene Ontology terms.

    PubMed

    Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G

    2004-09-02

    Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.

  10. Exploring Convergent Evolution to Provide a Foundation for Protein Engineering

    DTIC Science & Technology

    2009-02-26

    information if it does not display a currently valid OMB control number PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ORGANIZATION. RETORT DATE (DD-MM-YYYY...the DivergentSet and MotifCluster Algorithms Using support from this grant, we developed two software packages that provide key infrastructure for...software package we developed, MotifCluster," provides a novel way of detecting distantly related homologs, one of the key aims of the proposal. Unlike

  11. Identification of 15 candidate structured noncoding RNA motifs in fungi by comparative genomics.

    PubMed

    Li, Sanshu; Breaker, Ronald R

    2017-10-13

    With the development of rapid and inexpensive DNA sequencing, the genome sequences of more than 100 fungal species have been made available. This dataset provides an excellent resource for comparative genomics analyses, which can be used to discover genetic elements, including noncoding RNAs (ncRNAs). Bioinformatics tools similar to those used to uncover novel ncRNAs in bacteria, likewise, should be useful for searching fungal genomic sequences, and the relative ease of genetic experiments with some model fungal species could facilitate experimental validation studies. We have adapted a bioinformatics pipeline for discovering bacterial ncRNAs to systematically analyze many fungal genomes. This comparative genomics pipeline integrates information on conserved RNA sequence and structural features with alternative splicing information to reveal fungal RNA motifs that are candidate regulatory domains, or that might have other possible functions. A total of 15 prominent classes of structured ncRNA candidates were identified, including variant HDV self-cleaving ribozyme representatives, atypical snoRNA candidates, and possible structured antisense RNA motifs. Candidate regulatory motifs were also found associated with genes for ribosomal proteins, S-adenosylmethionine decarboxylase (SDC), amidase, and HexA protein involved in Woronin body formation. We experimentally confirm that the variant HDV ribozymes undergo rapid self-cleavage, and we demonstrate that the SDC RNA motif reduces the expression of SAM decarboxylase by translational repression. Furthermore, we provide evidence that several other motifs discovered in this study are likely to be functional ncRNA elements. Systematic screening of fungal genomes using a computational discovery pipeline has revealed the existence of a variety of novel structured ncRNAs. Genome contexts and similarities to known ncRNA motifs provide strong evidence for the biological and biochemical functions of some newly found ncRNA motifs. Although initial examinations of several motifs provide evidence for their likely functions, other motifs will require more in-depth analysis to reveal their functions.

  12. GibbsCluster: unsupervised clustering and alignment of peptide sequences.

    PubMed

    Andreatta, Massimo; Alvarez, Bruno; Nielsen, Morten

    2017-07-03

    Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry. The server is available at http://www.cbs.dtu.dk/services/GibbsCluster-2.0. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. MCAW-DB: A glycan profile database capturing the ambiguity of glycan recognition patterns.

    PubMed

    Hosoda, Masae; Takahashi, Yushi; Shiota, Masaaki; Shinmachi, Daisuke; Inomoto, Renji; Higashimoto, Shinichi; Aoki-Kinoshita, Kiyoko F

    2018-05-11

    Glycan-binding protein (GBP) interaction experiments, such as glycan microarrays, are often used to understand glycan recognition patterns. However, oftentimes the interpretation of glycan array experimental data makes it difficult to identify discrete GBP binding patterns due to their ambiguity. It is known that lectins, for example, are non-specific in their binding affinities; the same lectin can bind to different monosaccharides or even different glycan structures. In bioinformatics, several tools to mine the data generated from these sorts of experiments have been developed. These tools take a library of predefined motifs, which are commonly-found glycan patterns such as sialyl-Lewis X, and attempt to identify the motif(s) that are specific to the GBP being analyzed. In our previous work, as opposed to using predefined motifs, we developed the Multiple Carbohydrate Alignment with Weights (MCAW) tool to visualize the state of the glycans being recognized by the GBP under analysis. We previously reported on the effectiveness of our tool and algorithm by analyzing several glycan array datasets from the Consortium of Functional Glycomics (CFG). In this work, we report on our analysis of 1081 data sets which we collected from the CFG, the results of which we have made publicly and freely available as a database called MCAW-DB. We introduce this database, its usage and describe several analysis results. We show how MCAW-DB can be used to analyze glycan-binding patterns of GBPs amidst their ambiguity. For example, the visualization of glycan-binding patterns in MCAW-DB show how they correlate with the concentrations of the samples used in the array experiments. Using MCAW-DB, the patterns of glycans found to bind to various GBP-glycan binding proteins are visualized, indicating the binding "environment" of the glycans. Thus, the ambiguity of glycan recognition is numerically represented, along with the patterns of monosaccharides surrounding the binding region. The profiles in MCAW-DB could potentially be used as predictors of affinity of unknown or novel glycans to particular GBPs by comparing how well they match the existing profiles for those GBPs. Moreover, as the glycan profiles of diseased tissues become available, glycan alignments could also be used to identify glycan biomarkers unique to that tissue. Databases of these alignments may be of great use for drug discovery. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.

  14. Novel Pieces for the Emerging Picture of Sulfoximines in Drug Discovery: Synthesis and Evaluation of Sulfoximine Analogues of Marketed Drugs and Advanced Clinical Candidates.

    PubMed

    Sirvent, Juan Alberto; Lücking, Ulrich

    2017-04-06

    Sulfoximines have gained considerable recognition as an important structural motif in drug discovery of late. In particular, the clinical kinase inhibitors for the treatment of cancer, roniciclib (pan-CDK inhibitor), BAY 1143572 (P-TEFb inhibitor), and AZD 6738 (ATR inhibitor), have recently drawn considerable attention. Whilst the interest in this underrepresented functional group in drug discovery is clearly on the rise, there remains an incomplete understanding of the medicinal-chemistry-relevant properties of sulfoximines. Herein we report the synthesis and in vitro characterization of a variety of sulfoximine analogues of marketed drugs and advanced clinical candidates to gain a better understanding of this neglected functional group and its potential in drug discovery. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  15. Sequential visibility-graph motifs

    NASA Astrophysics Data System (ADS)

    Iacovacci, Jacopo; Lacasa, Lucas

    2016-04-01

    Visibility algorithms transform time series into graphs and encode dynamical information in their topology, paving the way for graph-theoretical time series analysis as well as building a bridge between nonlinear dynamics and network science. In this work we introduce and study the concept of sequential visibility-graph motifs, smaller substructures of n consecutive nodes that appear with characteristic frequencies. We develop a theory to compute in an exact way the motif profiles associated with general classes of deterministic and stochastic dynamics. We find that this simple property is indeed a highly informative and computationally efficient feature capable of distinguishing among different dynamics and robust against noise contamination. We finally confirm that it can be used in practice to perform unsupervised learning, by extracting motif profiles from experimental heart-rate series and being able, accordingly, to disentangle meditative from other relaxation states. Applications of this general theory include the automatic classification and description of physical, biological, and financial time series.

  16. New structures of Fe3S for rare-earth-free permanent magnets

    DOE PAGES

    Yu, Shu; Zhao, Xin; Wu, Shunqing; ...

    2018-02-25

    We applied adaptive genetic algorithm (AGA) to search for low-energy crystal structures of Fe 3S. A number of structures with energies lower than that of the experimentally reported Pnma and I-4 structures have been obtained from our AGA searches. These low-energy structures can be classified as layer-motif and column-motif structures. In the column-motif structures, Fe atoms self-assemble into rods with bcc type of underlying lattice, which are separated by the holes terminated by S atoms. In the layer-motif structures, the bulk Fe is broken into slabs of several layers passivated by S atoms. Magnetic properties calculations showed that the column-motifmore » structures exhibit reasonably high uniaxial magnetic anisotropy. In addition, we examined the effect of Co doping to Fe 3S and found magnetic anisotropy can be enhanced through Co doping.« less

  17. Cross-reactions vs co-sensitization evaluated by in silico motifs and in vitro IgE microarray testing.

    PubMed

    Pfiffner, P; Stadler, B M; Rasi, C; Scala, E; Mari, A

    2012-02-01

    Using an in silico allergen clustering method, we have recently shown that allergen extracts are highly cross-reactive. Here we used serological data from a multi-array IgE test based on recombinant or highly purified natural allergens to evaluate whether co-reactions are true cross-reactions or co-sensitizations by allergens with the same motifs. The serum database consisted of 3142 samples, each tested against 103 highly purified natural or recombinant allergens. Cross-reactivity was predicted by an iterative motif-finding algorithm through sequence motifs identified in 2708 known allergens. Allergen proteins containing the same motifs cross-reacted as predicted. However, proteins with identical motifs revealed a hierarchy in the degree of cross-reaction: The more frequent an allergen was positive in the allergic population, the less frequently it was cross-reacting and vice versa. Co-sensitization was analyzed by splitting the dataset into patient groups that were most likely sensitized through geographical occurrence of allergens. Interestingly, most co-reactions are cross-reactions but not co-sensitizations. The observed hierarchy of cross-reactivity may play an important role for the future management of allergic diseases. © 2011 John Wiley & Sons A/S.

  18. (φ,ψ)2-motifs: a purely conformation-based, fine-grained enumeration of protein parts at the two-residue level

    PubMed Central

    Hollingsworth, Scott A.; Lewis, Matthew C.; Berkholz, Donald S.; Wong, Weng-Keen; Karplus, P. Andrew

    2011-01-01

    A deep understanding of protein structure benefits from the use of a variety of classification strategies that enhance our ability to effectively describe local patterns of conformation. Here, we use a clustering algorithm to analyze 76,533 all-trans segments from protein structures solved at 1.2 Å resolution or better to create a purely φ,ψ-based comprehensive empirical categorization of common conformations adopted by two adjacent φ,ψ-pairs (i.e. (φ,ψ)2-motifs). The clustering algorithm works in an origin-shifted 4-dimensional space based on the two φ,ψ-pairs to yield a parameter-dependent list of (φ,ψ)2-motifs – in order of their prominence. The results are remarkably distinct from and complementary to the standard hydrogen-bond centered view of secondary structure. New insights include an unprecedented level of precision in describing the φ,ψ-angles of both previously known and novel motifs, an ordering of these motifs by their population density, a data-driven recommendation that the standard Cαi…Cαi+3 < 7 Å criteria for defining turns be changed to 6.5 Å, an identification of β-strand and turn capping motifs, and of conformational capping by residues in the polypeptide-II (PII) conformation. We further document that the conformational preferences of a residue are substantially influenced by the conformation of its neighbors, and suggest that accounting for these dependencies will improve protein modeling accuracy. Although the CUEVAS-4D(r10є14) “parts list” presented here is only an initial exploration of the complex (φ,ψ)2-landscape of proteins, it shows there is value to be had from this approach and opens the door to more in-depth characterizations at the (φ,ψ)2-level and at higher dimensions. PMID:22198294

  19. (φ,ψ)₂ motifs: a purely conformation-based fine-grained enumeration of protein parts at the two-residue level.

    PubMed

    Hollingsworth, Scott A; Lewis, Matthew C; Berkholz, Donald S; Wong, Weng-Keen; Karplus, P Andrew

    2012-02-10

    A deep understanding of protein structure benefits from the use of a variety of classification strategies that enhance our ability to effectively describe local patterns of conformation. Here, we use a clustering algorithm to analyze 76,533 all-trans segments from protein structures solved at 1.2 Å resolution or better to create a purely φ,ψ-based comprehensive empirical categorization of common conformations adopted by two adjacent φ,ψ pairs (i.e., (φ,ψ)(2) motifs). The clustering algorithm works in an origin-shifted four-dimensional space based on the two φ,ψ pairs to yield a parameter-dependent list of (φ,ψ)(2) motifs, in order of their prominence. The results are remarkably distinct from and complementary to the standard hydrogen-bond-centered view of secondary structure. New insights include an unprecedented level of precision in describing the φ,ψ angles of both previously known and novel motifs, ordering of these motifs by their population density, a data-driven recommendation that the standard C(α(i))…C(α(i+3))<7 Å criteria for defining turns be changed to 6.5 Å, identification of β-strand and turn capping motifs, and identification of conformational capping by residues in polypeptide II conformation. We further document that the conformational preferences of a residue are substantially influenced by the conformation of its neighbors, and we suggest that accounting for these dependencies will improve protein modeling accuracy. Although the CUEVAS-4D(r(10)є(14)) 'parts list' presented here is only an initial exploration of the complex (φ,ψ)(2) landscape of proteins, it shows that there is value to be had from this approach, and it opens the door to more in-depth characterizations at the (φ,ψ)(2) level and at higher dimensions. Copyright © 2011 Elsevier Ltd. All rights reserved.

  20. An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes.

    PubMed

    Liu, Bingqiang; Zhang, Hanyuan; Zhou, Chuan; Li, Guojun; Fennell, Anne; Wang, Guanghui; Kang, Yu; Liu, Qi; Ma, Qin

    2016-08-09

    Phylogenetic footprinting is an important computational technique for identifying cis-regulatory motifs in orthologous regulatory regions from multiple genomes, as motifs tend to evolve slower than their surrounding non-functional sequences. Its application, however, has several difficulties for optimizing the selection of orthologous data and reducing the false positives in motif prediction. Here we present an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP(3)). The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework to Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP(3) consistently outperformed other popular motif finding tools. We have integrated MP(3) into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes. The performance evaluation indicated that MP(3) is effective for predicting regulatory motifs in prokaryotic genomes. Its application may enhance progress in elucidating transcription regulation mechanism, thus provide benefit to the genomic research community and prokaryotic genome researchers in particular.

  1. A quantum causal discovery algorithm

    NASA Astrophysics Data System (ADS)

    Giarmatzi, Christina; Costa, Fabio

    2018-03-01

    Finding a causal model for a set of classical variables is now a well-established task—but what about the quantum equivalent? Even the notion of a quantum causal model is controversial. Here, we present a causal discovery algorithm for quantum systems. The input to the algorithm is a process matrix describing correlations between quantum events. Its output consists of different levels of information about the underlying causal model. Our algorithm determines whether the process is causally ordered by grouping the events into causally ordered non-signaling sets. It detects if all relevant common causes are included in the process, which we label Markovian, or alternatively if some causal relations are mediated through some external memory. For a Markovian process, it outputs a causal model, namely the causal relations and the corresponding mechanisms, represented as quantum states and channels. Our algorithm opens the route to more general quantum causal discovery methods.

  2. Composition-dependent stability of the medium-range order responsible for metallic glass formation

    DOE PAGES

    Zhang, Feng; Ji, Min; Fang, Xiao-Wei; ...

    2014-09-18

    The competition between the characteristic medium-range order corresponding to amorphous alloys and that in ordered crystalline phases is central to phase selection and morphology evolution under various processing conditions. We examine the stability of a model glass system, Cu–Zr, by comparing the energetics of various medium-range structural motifs over a wide range of compositions using first-principles calculations. Furthermore, we focus specifically on motifs that represent possible building blocks for competing glassy and crystalline phases, and we employ a genetic algorithm to efficiently identify the energetically favored decorations of each motif for specific compositions. These results show that a Bergman-type motifmore » with crystallization-resisting icosahedral symmetry is energetically most favorable in the composition range 0.63 < xCu < 0.68, and is the underlying motif for one of the three optimal glass-forming ranges observed experimentally for this binary system (Li et al., 2008). This work establishes an energy-based methodology to evaluate specific medium-range structural motifs which compete with stable crystalline nuclei in deeply undercooled liquids.« less

  3. Identification of family-specific residue packing motifs and their use for structure-based protein function prediction: I. Method development.

    PubMed

    Bandyopadhyay, Deepak; Huan, Jun; Prins, Jan; Snoeyink, Jack; Wang, Wei; Tropsha, Alexander

    2009-11-01

    Protein function prediction is one of the central problems in computational biology. We present a novel automated protein structure-based function prediction method using libraries of local residue packing patterns that are common to most proteins in a known functional family. Critical to this approach is the representation of a protein structure as a graph where residue vertices (residue name used as a vertex label) are connected by geometrical proximity edges. The approach employs two steps. First, it uses a fast subgraph mining algorithm to find all occurrences of family-specific labeled subgraphs for all well characterized protein structural and functional families. Second, it queries a new structure for occurrences of a set of motifs characteristic of a known family, using a graph index to speed up Ullman's subgraph isomorphism algorithm. The confidence of function inference from structure depends on the number of family-specific motifs found in the query structure compared with their distribution in a large non-redundant database of proteins. This method can assign a new structure to a specific functional family in cases where sequence alignments, sequence patterns, structural superposition and active site templates fail to provide accurate annotation.

  4. DNA motifs associated with aberrant CpG island methylation.

    PubMed

    Feltus, F Alex; Lee, Eva K; Costello, Joseph F; Plass, Christoph; Vertino, Paula M

    2006-05-01

    Epigenetic silencing involving the aberrant methylation of promoter region CpG islands is widely recognized as a tumor suppressor silencing mechanism in cancer. However, the molecular pathways underlying aberrant DNA methylation remain elusive. Recently we showed that, on a genome-wide level, CpG island loci differ in their intrinsic susceptibility to aberrant methylation and that this susceptibility can be predicted based on underlying sequence context. These data suggest that there are sequence/structural features that contribute to the protection from or susceptibility to aberrant methylation. Here we use motif elicitation coupled with classification techniques to identify DNA sequence motifs that selectively define methylation-prone or methylation-resistant CpG islands. Motifs common to 28 methylation-prone or 47 methylation-resistant CpG island-containing genomic fragments were determined using the MEME and MAST algorithms (). The five most discriminatory motifs derived from methylation-prone sequences were found to be associated with CpG islands in general and were nonrandomly distributed throughout the genome. In contrast, the eight most discriminatory motifs derived from the methylation-resistant CpG islands were randomly distributed throughout the genome. Interestingly, this latter group tended to associate with Alu and other repetitive sequences. Used together, the frequency of occurrence of these motifs successfully discriminated methylation-prone and methylation-resistant CpG island groups with an accuracy of 87% after 10-fold cross-validation. The motifs identified here are candidate methylation-targeting or methylation-protection DNA sequences.

  5. Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jin, R; McCallen, S; Almaas, E

    2007-05-28

    Complex networks have been used successfully in scientific disciplines ranging from sociology to microbiology to describe systems of interacting units. Until recently, studies of complex networks have mainly focused on their network topology. However, in many real world applications, the edges and vertices have associated attributes that are frequently represented as vertex or edge weights. Furthermore, these weights are often not static, instead changing with time and forming a time series. Hence, to fully understand the dynamics of the complex network, we have to consider both network topology and related time series data. In this work, we propose a motifmore » mining approach to identify trend motifs for such purposes. Simply stated, a trend motif describes a recurring subgraph where each of its vertices or edges displays similar dynamics over a userdefined period. Given this, each trend motif occurrence can help reveal significant events in a complex system; frequent trend motifs may aid in uncovering dynamic rules of change for the system, and the distribution of trend motifs may characterize the global dynamics of the system. Here, we have developed efficient mining algorithms to extract trend motifs. Our experimental validation using three disparate empirical datasets, ranging from the stock market, world trade, to a protein interaction network, has demonstrated the efficiency and effectiveness of our approach.« less

  6. A systematic analysis of a mi-RNA inter-pathway regulatory motif

    PubMed Central

    2013-01-01

    Background The continuing discovery of new types and functions of small non-coding RNAs is suggesting the presence of regulatory mechanisms far more complex than the ones currently used to study and design Gene Regulatory Networks. Just focusing on the roles of micro RNAs (miRNAs), they have been found to be part of several intra-pathway regulatory motifs. However, inter-pathway regulatory mechanisms have been often neglected and require further investigation. Results In this paper we present the result of a systems biology study aimed at analyzing a high-level inter-pathway regulatory motif called Pathway Protection Loop, not previously described, in which miRNAs seem to play a crucial role in the successful behavior and activation of a pathway. Through the automatic analysis of a large set of public available databases, we found statistical evidence that this inter-pathway regulatory motif is very common in several classes of KEGG Homo Sapiens pathways and concurs in creating a complex regulatory network involving several pathways connected by this specific motif. The role of this motif seems also confirmed by a deeper review of other research activities on selected representative pathways. Conclusions Although previous studies suggested transcriptional regulation mechanism at the pathway level such as the Pathway Protection Loop, a high-level analysis like the one proposed in this paper is still missing. The understanding of higher-level regulatory motifs could, as instance, lead to new approaches in the identification of therapeutic targets because it could unveil new and “indirect” paths to activate or silence a target pathway. However, a lot of work still needs to be done to better uncover this high-level inter-pathway regulation including enlarging the analysis to other small non-coding RNA molecules. PMID:24152805

  7. Identification of helix capping and β-turn motifs from NMR chemical shifts

    PubMed Central

    Shen, Yang; Bax, Ad

    2012-01-01

    We present an empirical method for identification of distinct structural motifs in proteins on the basis of experimentally determined backbone and 13Cβ chemical shifts. Elements identified include the N-terminal and C-terminal helix capping motifs and five types of β-turns: I, II, I′, II′ and VIII. Using a database of proteins of known structure, the NMR chemical shifts, together with the PDB-extracted amino acid preference of the helix capping and β-turn motifs are used as input data for training an artificial neural network algorithm, which outputs the statistical probability of finding each motif at any given position in the protein. The trained neural networks, contained in the MICS (motif identification from chemical shifts) program, also provide a confidence level for each of their predictions, and values ranging from ca 0.7–0.9 for the Matthews correlation coefficient of its predictions far exceed that attainable by sequence analysis. MICS is anticipated to be useful both in the conventional NMR structure determination process and for enhancing on-going efforts to determine protein structures solely on the basis of chemical shift information, where it can aid in identifying protein database fragments suitable for use in building such structures. PMID:22314702

  8. Unique scorpion toxin with a putative ancestral fold provides insight into evolution of the inhibitor cystine knot motif.

    PubMed

    Smith, Jennifer J; Hill, Justine M; Little, Michelle J; Nicholson, Graham M; King, Glenn F; Alewood, Paul F

    2011-06-28

    The three-disulfide inhibitor cystine knot (ICK) motif is a fold common to venom peptides from spiders, scorpions, and aquatic cone snails. Over a decade ago it was proposed that the ICK motif is an elaboration of an ancestral two-disulfide fold coined the disulfide-directed β-hairpin (DDH). Here we report the isolation, characterization, and structure of a novel toxin [U(1)-liotoxin-Lw1a (U(1)-LITX-Lw1a)] from the venom of the scorpion Liocheles waigiensis that is the first example of a native peptide that adopts the DDH fold. U(1)-LITX-Lw1a not only represents the discovery of a missing link in venom protein evolution, it is the first member of a fourth structural fold to be adopted by scorpion-venom peptides. Additionally, we show that U(1)-LITX-Lw1a has potent insecticidal activity across a broad range of insect pest species, thereby providing a unique structural scaffold for bioinsecticide development.

  9. RSAT 2015: Regulatory Sequence Analysis Tools

    PubMed Central

    Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A.; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M.; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques

    2015-01-01

    RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. PMID:25904632

  10. Modular and configurable optimal sequence alignment software: Cola.

    PubMed

    Zamani, Neda; Sundström, Görel; Höppner, Marc P; Grabherr, Manfred G

    2014-01-01

    The fundamental challenge in optimally aligning homologous sequences is to define a scoring scheme that best reflects the underlying biological processes. Maximising the overall number of matches in the alignment does not always reflect the patterns by which nucleotides mutate. Efficiently implemented algorithms that can be parameterised to accommodate more complex non-linear scoring schemes are thus desirable. We present Cola, alignment software that implements different optimal alignment algorithms, also allowing for scoring contiguous matches of nucleotides in a nonlinear manner. The latter places more emphasis on short, highly conserved motifs, and less on the surrounding nucleotides, which can be more diverged. To illustrate the differences, we report results from aligning 14,100 sequences from 3' untranslated regions of human genes to 25 of their mammalian counterparts, where we found that a nonlinear scoring scheme is more consistent than a linear scheme in detecting short, conserved motifs. Cola is freely available under LPGL from https://github.com/nedaz/cola.

  11. Research on hotspot discovery in internet public opinions based on improved K-means.

    PubMed

    Wang, Gensheng

    2013-01-01

    How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which plays a key role for governments and corporations to find useful information from mass data in the Internet. An improved K-means algorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculation principle of original K-means algorithm. First, some new methods are designed to preprocess website texts, select and express the characteristics of website texts, and define the similarity between two website texts, respectively. Second, clustering principle and the method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original K-means algorithm. Finally, the experimental results verify that the improved algorithm can improve the clustering stability and classification accuracy of hotspot discovery in internet public opinions when used in practice.

  12. Research on Hotspot Discovery in Internet Public Opinions Based on Improved K-Means

    PubMed Central

    2013-01-01

    How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which plays a key role for governments and corporations to find useful information from mass data in the Internet. An improved K-means algorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculation principle of original K-means algorithm. First, some new methods are designed to preprocess website texts, select and express the characteristics of website texts, and define the similarity between two website texts, respectively. Second, clustering principle and the method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original K-means algorithm. Finally, the experimental results verify that the improved algorithm can improve the clustering stability and classification accuracy of hotspot discovery in internet public opinions when used in practice. PMID:24106496

  13. Sequence Bundles: a novel method for visualising, discovering and exploring sequence motifs

    PubMed Central

    2014-01-01

    Background We introduce Sequence Bundles--a novel data visualisation method for representing multiple sequence alignments (MSAs). We identify and address key limitations of the existing bioinformatics data visualisation methods (i.e. the Sequence Logo) by enabling Sequence Bundles to give salient visual expression to sequence motifs and other data features, which would otherwise remain hidden. Methods For the development of Sequence Bundles we employed research-led information design methodologies. Sequences are encoded as uninterrupted, semi-opaque lines plotted on a 2-dimensional reconfigurable grid. Each line represents a single sequence. The thickness and opacity of the stack at each residue in each position indicates the level of conservation and the lines' curved paths expose patterns in correlation and functionality. Several MSAs can be visualised in a composite image. The Sequence Bundles method is designed to favour a tangible, continuous and intuitive display of information. Results We have developed a software demonstration application for generating a Sequence Bundles visualisation of MSAs provided for the BioVis 2013 redesign contest. A subsequent exploration of the visualised line patterns allowed for the discovery of a number of interesting features in the dataset. Reported features include the extreme conservation of sequences displaying a specific residue and bifurcations of the consensus sequence. Conclusions Sequence Bundles is a novel method for visualisation of MSAs and the discovery of sequence motifs. It can aid in generating new insight and hypothesis making. Sequence Bundles is well disposed for future implementation as an interactive visual analytics software, which can complement existing visualisation tools. PMID:25237395

  14. Heterocyclic N-Oxides – An Emerging Class of Therapeutic Agents

    PubMed Central

    Mfuh, Adelphe M.; Larionov, Oleg V.

    2016-01-01

    Heterocyclic N-oxides have emerged as potent compounds with anticancer, antibacterial, antihypertensive, antiparasitic, anti-HIV, anti-inflammatory, herbicidal, neuroprotective, and procognitive activities. The N-oxide motif has been successfully employed in a number of recent drug development projects. This review surveys the emergence of this scaffold in the mainstream medicinal chemistry with a focus on the discovery of the heterocyclic N-oxide drugs, N-oxide-specific mechanisms of action, drug-receptor interactions and synthetic avenues to these compounds. As the first review on this subject that covers the developments since 1950s to date, it is expected that it will inspire wider implementation of the heterocyclic N-oxide motif in the rational design of new medicinal agents. PMID:26087764

  15. RNA Graph Partitioning for the Discovery of RNA Modularity: A Novel Application of Graph Partition Algorithm to Biology

    PubMed Central

    Elmetwaly, Shereef; Schlick, Tamar

    2014-01-01

    Graph representations have been widely used to analyze and design various economic, social, military, political, and biological networks. In systems biology, networks of cells and organs are useful for understanding disease and medical treatments and, in structural biology, structures of molecules can be described, including RNA structures. In our RNA-As-Graphs (RAG) framework, we represent RNA structures as tree graphs by translating unpaired regions into vertices and helices into edges. Here we explore the modularity of RNA structures by applying graph partitioning known in graph theory to divide an RNA graph into subgraphs. To our knowledge, this is the first application of graph partitioning to biology, and the results suggest a systematic approach for modular design in general. The graph partitioning algorithms utilize mathematical properties of the Laplacian eigenvector (µ2) corresponding to the second eigenvalues (λ2) associated with the topology matrix defining the graph: λ2 describes the overall topology, and the sum of µ2′s components is zero. The three types of algorithms, termed median, sign, and gap cuts, divide a graph by determining nodes of cut by median, zero, and largest gap of µ2′s components, respectively. We apply these algorithms to 45 graphs corresponding to all solved RNA structures up through 11 vertices (∼220 nucleotides). While we observe that the median cut divides a graph into two similar-sized subgraphs, the sign and gap cuts partition a graph into two topologically-distinct subgraphs. We find that the gap cut produces the best biologically-relevant partitioning for RNA because it divides RNAs at less stable connections while maintaining junctions intact. The iterative gap cuts suggest basic modules and assembly protocols to design large RNA structures. Our graph substructuring thus suggests a systematic approach to explore the modularity of biological networks. In our applications to RNA structures, subgraphs also suggest design strategies for novel RNA motifs. PMID:25188578

  16. An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance.

    PubMed

    Casimiro, Ana C; Vinga, Susana; Freitas, Ana T; Oliveira, Arlindo L

    2008-02-07

    Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.

  17. Genetically Encoded Fragment-Based Discovery of Glycopeptide Ligands for Carbohydrate-Binding Proteins

    DOE PAGES

    Ng, Simon; Lin, Edith; Kitov, Pavel I.; ...

    2015-04-10

    Here we describe an approach to accelerate the search for competitive inhibitors for carbohydrate-recognition domains (CRDs). Genetically encoded fragment-based-discovery (GE-FBD) uses selection of phagedisplayed glycopeptides to dock a glycan fragment at the CRD and guide selection of Synergistic peptide motifs adjacent to the CRD. Starting from concanavalin A (ConA), a mannose (Man)-binding protein, as a bait, we narrowed a library of 10 8 glycopeptides to 86 leads that share a consensus motif, Man-WYD. Validation of synthetic leads yielded Man-WYDLF that exhibited 40 50-fold enhancement in affinity over methyl α-D-mannopyranoside (MeMan). Lectin array Suggested specificity: Man-WYD derivative bound only to 3more » out of 17 proteins-ConA, LcH, and PSA-that bind to Man. An X-ray structure of ConA.:Man-WYD proved that the trimannoside core and Man-WYD exhibit identical CRD docking; but their extra-CRD binding modes are significantly. different. Still, they have comparable affinity and selectivity for various Man-binding proteins. The intriguing observation provides new insight into functional mimicry :of carbohydrates by peptide ligands. GE-FBD may provide an alternative to rapidly search for competitive inhibitors for lectins.« less

  18. Genetically Encoded Fragment-Based Discovery of Glycopeptide Ligands for Carbohydrate-Binding Proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ng, Simon; Lin, Edith; Kitov, Pavel I.

    Here we describe an approach to accelerate the search for competitive inhibitors for carbohydrate-recognition domains (CRDs). Genetically encoded fragment-based-discovery (GE-FBD) uses selection of phagedisplayed glycopeptides to dock a glycan fragment at the CRD and guide selection of Synergistic peptide motifs adjacent to the CRD. Starting from concanavalin A (ConA), a mannose (Man)-binding protein, as a bait, we narrowed a library of 10 8 glycopeptides to 86 leads that share a consensus motif, Man-WYD. Validation of synthetic leads yielded Man-WYDLF that exhibited 40 50-fold enhancement in affinity over methyl α-D-mannopyranoside (MeMan). Lectin array Suggested specificity: Man-WYD derivative bound only to 3more » out of 17 proteins-ConA, LcH, and PSA-that bind to Man. An X-ray structure of ConA.:Man-WYD proved that the trimannoside core and Man-WYD exhibit identical CRD docking; but their extra-CRD binding modes are significantly. different. Still, they have comparable affinity and selectivity for various Man-binding proteins. The intriguing observation provides new insight into functional mimicry :of carbohydrates by peptide ligands. GE-FBD may provide an alternative to rapidly search for competitive inhibitors for lectins.« less

  19. A Feature-Based Approach to Modeling Protein–DNA Interactions

    PubMed Central

    Segal, Eran

    2008-01-01

    Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. PMID:18725950

  20. Discovery of potent HIV-1 non-nucleoside reverse transcriptase inhibitors from arylthioacetanilide structural motif.

    PubMed

    Li, Wenxin; Li, Xiao; De Clercq, Erik; Zhan, Peng; Liu, Xinyong

    2015-09-18

    The poor pharmacokinetics, side effects and particularly the rapid emergence of drug resistance compromise the efficiency of the clinically used anti-HIV drugs. Therefore, the discovery of novel and effective NNRTIs is still an extremely primary mission. Arylthioacetanilide family is one of the highly active HIV-1 NNRTIs against wide-type (WT) HIV-1 and a wide range of drug-resistant mutant strains. Especially, VRX-480773 and RDEA806 have been chosen as candidates for further clinical studies. In this article, we review the discovery and development of the arylthioacetanilides, and, especially, pay much attention to the structural modifications, SARs conclusions and molecular modeling. Moreover, several medicinal chemistry strategies to overcome drug resistance involved in the optimization process of arylthioacetanilides are highlighted, providing valuable clues for further investigations. Copyright © 2015 Elsevier Masson SAS. All rights reserved.

  1. Structural details (kinks and non-α conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors

    PubMed Central

    Rigoutsos, Isidore; Riek, Peter; Graham, Robert M.; Novotny, Jiri

    2003-01-01

    One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular α-helical character (i.e. π-helices, 310-helices and kinks). A ‘search engine’ derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above ‘non-canonical’ helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from α-helicity are encoded locally in sequence patterns only about 7–9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure–function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html. PMID:12888523

  2. Structural details (kinks and non-alpha conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors.

    PubMed

    Rigoutsos, Isidore; Riek, Peter; Graham, Robert M; Novotny, Jiri

    2003-08-01

    One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular alpha-helical character (i.e. pi-helices, 3(10)-helices and kinks). A 'search engine' derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above 'non-canonical' helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from alpha-helicity are encoded locally in sequence patterns only about 7-9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure-function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html.

  3. SMARTIV: combined sequence and structure de-novo motif discovery for in-vivo RNA binding data.

    PubMed

    Polishchuk, Maya; Paz, Inbal; Yakhini, Zohar; Mandel-Gutfreund, Yael

    2018-05-25

    Gene expression regulation is highly dependent on binding of RNA-binding proteins (RBPs) to their RNA targets. Growing evidence supports the notion that both RNA primary sequence and its local secondary structure play a role in specific Protein-RNA recognition and binding. Despite the great advance in high-throughput experimental methods for identifying sequence targets of RBPs, predicting the specific sequence and structure binding preferences of RBPs remains a major challenge. We present a novel webserver, SMARTIV, designed for discovering and visualizing combined RNA sequence and structure motifs from high-throughput RNA-binding data, generated from in-vivo experiments. The uniqueness of SMARTIV is that it predicts motifs from enriched k-mers that combine information from ranked RNA sequences and their predicted secondary structure, obtained using various folding methods. Consequently, SMARTIV generates Position Weight Matrices (PWMs) in a combined sequence and structure alphabet with assigned P-values. SMARTIV concisely represents the sequence and structure motif content as a single graphical logo, which is informative and easy for visual perception. SMARTIV was examined extensively on a variety of high-throughput binding experiments for RBPs from different families, generated from different technologies, showing consistent and accurate results. Finally, SMARTIV is a user-friendly webserver, highly efficient in run-time and freely accessible via http://smartiv.technion.ac.il/.

  4. Computational prediction of new auxetic materials.

    PubMed

    Dagdelen, John; Montoya, Joseph; de Jong, Maarten; Persson, Kristin

    2017-08-22

    Auxetics comprise a rare family of materials that manifest negative Poisson's ratio, which causes an expansion instead of contraction under tension. Most known homogeneously auxetic materials are porous foams or artificial macrostructures and there are few examples of inorganic materials that exhibit this behavior as polycrystalline solids. It is now possible to accelerate the discovery of materials with target properties, such as auxetics, using high-throughput computations, open databases, and efficient search algorithms. Candidates exhibiting features correlating with auxetic behavior were chosen from the set of more than 67 000 materials in the Materials Project database. Poisson's ratios were derived from the calculated elastic tensor of each material in this reduced set of compounds. We report that this strategy results in the prediction of three previously unidentified homogeneously auxetic materials as well as a number of compounds with a near-zero homogeneous Poisson's ratio, which are here denoted "anepirretic materials".There are very few inorganic materials with auxetic homogenous Poisson's ratio in polycrystalline form. Here authors develop an approach to screening materials databases for target properties such as negative Poisson's ratio by using stability and structural motifs to predict new instances of homogenous auxetic behavior as well as a number of materials with near-zero Poisson's ratio.

  5. Order priors for Bayesian network discovery with an application to malware phylogeny

    DOE PAGES

    Oyen, Diane; Anderson, Blake; Sentz, Kari; ...

    2017-09-15

    Here, Bayesian networks have been used extensively to model and discover dependency relationships among sets of random variables. We learn Bayesian network structure with a combination of human knowledge about the partial ordering of variables and statistical inference of conditional dependencies from observed data. Our approach leverages complementary information from human knowledge and inference from observed data to produce networks that reflect human beliefs about the system as well as to fit the observed data. Applying prior beliefs about partial orderings of variables is an approach distinctly different from existing methods that incorporate prior beliefs about direct dependencies (or edges)more » in a Bayesian network. We provide an efficient implementation of the partial-order prior in a Bayesian structure discovery learning algorithm, as well as an edge prior, showing that both priors meet the local modularity requirement necessary for an efficient Bayesian discovery algorithm. In benchmark studies, the partial-order prior improves the accuracy of Bayesian network structure learning as well as the edge prior, even though order priors are more general. Our primary motivation is in characterizing the evolution of families of malware to aid cyber security analysts. For the problem of malware phylogeny discovery, we find that our algorithm, compared to existing malware phylogeny algorithms, more accurately discovers true dependencies that are missed by other algorithms.« less

  6. Order priors for Bayesian network discovery with an application to malware phylogeny

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Oyen, Diane; Anderson, Blake; Sentz, Kari

    Here, Bayesian networks have been used extensively to model and discover dependency relationships among sets of random variables. We learn Bayesian network structure with a combination of human knowledge about the partial ordering of variables and statistical inference of conditional dependencies from observed data. Our approach leverages complementary information from human knowledge and inference from observed data to produce networks that reflect human beliefs about the system as well as to fit the observed data. Applying prior beliefs about partial orderings of variables is an approach distinctly different from existing methods that incorporate prior beliefs about direct dependencies (or edges)more » in a Bayesian network. We provide an efficient implementation of the partial-order prior in a Bayesian structure discovery learning algorithm, as well as an edge prior, showing that both priors meet the local modularity requirement necessary for an efficient Bayesian discovery algorithm. In benchmark studies, the partial-order prior improves the accuracy of Bayesian network structure learning as well as the edge prior, even though order priors are more general. Our primary motivation is in characterizing the evolution of families of malware to aid cyber security analysts. For the problem of malware phylogeny discovery, we find that our algorithm, compared to existing malware phylogeny algorithms, more accurately discovers true dependencies that are missed by other algorithms.« less

  7. TOPDOM: database of conservatively located domains and motifs in proteins.

    PubMed

    Varga, Julia; Dobson, László; Tusnády, Gábor E

    2016-09-01

    The TOPDOM database-originally created as a collection of domains and motifs located consistently on the same side of the membranes in α-helical transmembrane proteins-has been updated and extended by taking into consideration consistently localized domains and motifs in globular proteins, too. By taking advantage of the recently developed CCTOP algorithm to determine the type of a protein and predict topology in case of transmembrane proteins, and by applying a thorough search for domains and motifs as well as utilizing the most up-to-date version of all source databases, we managed to reach a 6-fold increase in the size of the whole database and a 2-fold increase in the number of transmembrane proteins. TOPDOM database is available at http://topdom.enzim.hu The webpage utilizes the common Apache, PHP5 and MySQL software to provide the user interface for accessing and searching the database. The database itself is generated on a high performance computer. tusnady.gabor@ttk.mta.hu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  8. General method to find the attractors of discrete dynamic models of biological systems.

    PubMed

    Gan, Xiao; Albert, Réka

    2018-04-01

    Analyzing the long-term behaviors (attractors) of dynamic models of biological networks can provide valuable insight. We propose a general method that can find the attractors of multilevel discrete dynamical systems by extending a method that finds the attractors of a Boolean network model. The previous method is based on finding stable motifs, subgraphs whose nodes' states can stabilize on their own. We extend the framework from binary states to any finite discrete levels by creating a virtual node for each level of a multilevel node, and describing each virtual node with a quasi-Boolean function. We then create an expanded representation of the multilevel network, find multilevel stable motifs and oscillating motifs, and identify attractors by successive network reduction. In this way, we find both fixed point attractors and complex attractors. We implemented an algorithm, which we test and validate on representative synthetic networks and on published multilevel models of biological networks. Despite its primary motivation to analyze biological networks, our motif-based method is general and can be applied to any finite discrete dynamical system.

  9. General method to find the attractors of discrete dynamic models of biological systems

    NASA Astrophysics Data System (ADS)

    Gan, Xiao; Albert, Réka

    2018-04-01

    Analyzing the long-term behaviors (attractors) of dynamic models of biological networks can provide valuable insight. We propose a general method that can find the attractors of multilevel discrete dynamical systems by extending a method that finds the attractors of a Boolean network model. The previous method is based on finding stable motifs, subgraphs whose nodes' states can stabilize on their own. We extend the framework from binary states to any finite discrete levels by creating a virtual node for each level of a multilevel node, and describing each virtual node with a quasi-Boolean function. We then create an expanded representation of the multilevel network, find multilevel stable motifs and oscillating motifs, and identify attractors by successive network reduction. In this way, we find both fixed point attractors and complex attractors. We implemented an algorithm, which we test and validate on representative synthetic networks and on published multilevel models of biological networks. Despite its primary motivation to analyze biological networks, our motif-based method is general and can be applied to any finite discrete dynamical system.

  10. Structural insight into RNA recognition motifs: versatile molecular Lego building blocks for biological systems.

    PubMed

    Muto, Yutaka; Yokoyama, Shigeyuki

    2012-01-01

    'RNA recognition motifs (RRMs)' are common domain-folds composed of 80-90 amino-acid residues in eukaryotes, and have been identified in many cellular proteins. At first they were known as RNA binding domains. Through discoveries over the past 20 years, however, the RRMs have been shown to exhibit versatile molecular recognition activities and to behave as molecular Lego building blocks to construct biological systems. Novel RNA/protein recognition modes by RRMs are being identified, and more information about the molecular recognition by RRMs is becoming available. These RNA/protein recognition modes are strongly correlated with their biological significance. In this review, we would like to survey the recent progress on these versatile molecular recognition modules. Copyright © 2012 John Wiley & Sons, Ltd.

  11. Detecting Statistically Significant Communities of Triangle Motifs in Undirected Networks

    DTIC Science & Technology

    2015-03-16

    moderately-sized networks. As a consequence, throughout this effort, a simulated annealing (SA) algorithm will be employed to effectively search the...then increment k by 1 and repeat the search to find z∗3. Once can continue to increment k until W < zδ, at which point the algorithm will stop and...collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources

  12. RSAT 2015: Regulatory Sequence Analysis Tools.

    PubMed

    Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques

    2015-07-01

    RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. Combinatorial phenotypic screen uncovers unrecognized family of extended thiourea inhibitors with copper-dependent anti-staphylococcal activity.

    PubMed

    Dalecki, Alex G; Malalasekera, Aruni P; Schaaf, Kaitlyn; Kutsch, Olaf; Bossmann, Stefan H; Wolschendorf, Frank

    2016-04-01

    The continuous rise of multi-drug resistant pathogenic bacteria has become a significant challenge for the health care system. In particular, novel drugs to treat infections of methicillin-resistant Staphylococcus aureus strains (MRSA) are needed, but traditional drug discovery campaigns have largely failed to deliver clinically suitable antibiotics. More than simply new drugs, new drug discovery approaches are needed to combat bacterial resistance. The recently described phenomenon of copper-dependent inhibitors has galvanized research exploring the use of metal-coordinating molecules to harness copper's natural antibacterial properties for therapeutic purposes. Here, we describe the results of the first concerted screening effort to identify copper-dependent inhibitors of Staphylococcus aureus. A standard library of 10 000 compounds was assayed for anti-staphylococcal activity, with hits defined as those compounds with a strict copper-dependent inhibitory activity. A total of 53 copper-dependent hit molecules were uncovered, similar to the copper independent hit rate of a traditionally executed campaign conducted in parallel on the same library. Most prominent was a hit family with an extended thiourea core structure, termed the NNSN motif. This motif resulted in copper-dependent and copper-specific S. aureus inhibition, while simultaneously being well tolerated by eukaryotic cells. Importantly, we could demonstrate that copper binding by the NNSN motif is highly unusual and likely responsible for the promising biological qualities of these compounds. A subsequent chemoinformatic meta-analysis of the ChEMBL chemical database confirmed the NNSNs as an unrecognized staphylococcal inhibitor, despite the family's presence in many chemical screening libraries. Thus, our copper-biased screen has proven able to discover inhibitors within previously screened libraries, offering a mechanism to reinvigorate exhausted molecular collections.

  14. Development of a Schistosoma mansoni shotgun O-glycan microarray and application to the discovery of new antigenic schistosome glycan motifs.

    PubMed

    van Diepen, Angela; van der Plas, Arend-Jan; Kozak, Radoslaw P; Royle, Louise; Dunne, David W; Hokke, Cornelis H

    2015-06-01

    Upon infection with Schistosoma, antibody responses are mounted that are largely directed against glycans. Over the last few years significant progress has been made in characterising the antigenic properties of N-glycans of Schistosoma mansoni. Despite also being abundantly expressed by schistosomes, much less is understood about O-glycans and antibody responses to these have not yet been systematically analysed. Antibody binding to schistosome glycans can be analysed efficiently and quantitatively using glycan microarrays, but O-glycan array construction and exploration is lagging behind because no universal O-glycanase is available, and release of O-glycans has been dependent on chemical methods. Recently, a modified hydrazinolysis method has been developed that allows the release of O-glycans with free reducing termini and limited degradation, and we applied this method to obtain O-glycans from different S. mansoni life stages. Two-dimensional HPLC separation of 2-aminobenzoic acid-labelled O-glycans generated 362 O-glycan-containing fractions that were printed on an epoxide-modified glass slide, thereby generating the first shotgun O-glycan microarray containing naturally occurring schistosome O-glycans. Monoclonal antibodies and mass spectrometry showed that the O-glycan microarray contains well-known antigenic glycan motifs as well as numerous other, potentially novel, antibody targets. Incubations of the microarrays with sera from Schistosoma-infected humans showed substantial antibody responses to O-glycans in addition to those observed to the previously investigated N- and glycosphingolipid glycans. This underlines the importance of the inclusion of these often schistosome-specific O-glycans in glycan antigen studies and indicates that O-glycans contain novel antigenic motifs that have potential for use in diagnostic methods and studies aiming at the discovery of vaccine targets. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  15. ATtRACT-a database of RNA-binding proteins and associated motifs.

    PubMed

    Giudice, Girolamo; Sánchez-Cabo, Fátima; Torroja, Carlos; Lara-Pezzi, Enrique

    2016-01-01

    RNA-binding proteins (RBPs) play a crucial role in key cellular processes, including RNA transport, splicing, polyadenylation and stability. Understanding the interaction between RBPs and RNA is key to improve our knowledge of RNA processing, localization and regulation in a global manner. Despite advances in recent years, a unified non-redundant resource that includes information on experimentally validated motifs, RBPs and integrated tools to exploit this information is lacking. Here, we developed a database named ATtRACT (available athttp://attract.cnic.es) that compiles information on 370 RBPs and 1583 RBP consensus binding motifs, 192 of which are not present in any other database. To populate ATtRACT we (i) extracted and hand-curated experimentally validated data from CISBP-RNA, SpliceAid-F, RBPDB databases, (ii) integrated and updated the unavailable ASD database and (iii) extracted information from Protein-RNA complexes present in Protein Data Bank database through computational analyses. ATtRACT provides also efficient algorithms to search a specific motif and scan one or more RNA sequences at a time. It also allows discoveringde novomotifs enriched in a set of related sequences and compare them with the motifs included in the database.Database URL:http:// attract. cnic. es. © The Author(s) 2016. Published by Oxford University Press.

  16. Constructing a Graph Database for Semantic Literature-Based Discovery.

    PubMed

    Hristovski, Dimitar; Kastrin, Andrej; Dinevski, Dejan; Rindflesch, Thomas C

    2015-01-01

    Literature-based discovery (LBD) generates discoveries, or hypotheses, by combining what is already known in the literature. Potential discoveries have the form of relations between biomedical concepts; for example, a drug may be determined to treat a disease other than the one for which it was intended. LBD views the knowledge in a domain as a network; a set of concepts along with the relations between them. As a starting point, we used SemMedDB, a database of semantic relations between biomedical concepts extracted with SemRep from Medline. SemMedDB is distributed as a MySQL relational database, which has some problems when dealing with network data. We transformed and uploaded SemMedDB into the Neo4j graph database, and implemented the basic LBD discovery algorithms with the Cypher query language. We conclude that storing the data needed for semantic LBD is more natural in a graph database. Also, implementing LBD discovery algorithms is conceptually simpler with a graph query language when compared with standard SQL.

  17. Solar fuels photoanode materials discovery by integrating high-throughput theory and experiment

    DOE PAGES

    Yan, Qimin; Yu, Jie; Suram, Santosh K.; ...

    2017-03-06

    The limited number of known low-band-gap photoelectrocatalytic materials poses a significant challenge for the generation of chemical fuels from sunlight. Here, using high-throughput ab initio theory with experiments in an integrated workflow, we find eight ternary vanadate oxide photoanodes in the target band-gap range (1.2-2.8 eV). Detailed analysis of these vanadate compounds reveals the key role of VO 4 structural motifs and electronic band-edge character in efficient photoanodes, initiating a genome for such materials and paving the way for a broadly applicable high-throughput-discovery and materials-by-design feedback loop. Considerably expanding the number of known photoelectrocatalysts for water oxidation, our study establishesmore » ternary metal vanadates as a prolific class of photoanodematerials for generation of chemical fuels from sunlight and demonstrates our high-throughput theory-experiment pipeline as a prolific approach to materials discovery.« less

  18. Solar fuels photoanode materials discovery by integrating high-throughput theory and experiment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yan, Qimin; Yu, Jie; Suram, Santosh K.

    The limited number of known low-band-gap photoelectrocatalytic materials poses a significant challenge for the generation of chemical fuels from sunlight. Here, using high-throughput ab initio theory with experiments in an integrated workflow, we find eight ternary vanadate oxide photoanodes in the target band-gap range (1.2-2.8 eV). Detailed analysis of these vanadate compounds reveals the key role of VO 4 structural motifs and electronic band-edge character in efficient photoanodes, initiating a genome for such materials and paving the way for a broadly applicable high-throughput-discovery and materials-by-design feedback loop. Considerably expanding the number of known photoelectrocatalysts for water oxidation, our study establishesmore » ternary metal vanadates as a prolific class of photoanodematerials for generation of chemical fuels from sunlight and demonstrates our high-throughput theory-experiment pipeline as a prolific approach to materials discovery.« less

  19. DNA motif elucidation using belief propagation.

    PubMed

    Wong, Ka-Chun; Chan, Tak-Ming; Peng, Chengbin; Li, Yue; Zhang, Zhaolei

    2013-09-01

    Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k=8∼10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors' websites: e.g. http://www.cs.toronto.edu/∼wkc/kmerHMM.

  20. High throughput light absorber discovery, Part 1: An algorithm for automated tauc analysis

    DOE PAGES

    Suram, Santosh K.; Newhouse, Paul F.; Gregoire, John M.

    2016-09-23

    High-throughput experimentation provides efficient mapping of composition-property relationships, and its implementation for the discovery of optical materials enables advancements in solar energy and other technologies. In a high throughput pipeline, automated data processing algorithms are often required to match experimental throughput, and we present an automated Tauc analysis algorithm for estimating band gap energies from optical spectroscopy data. The algorithm mimics the judgment of an expert scientist, which is demonstrated through its application to a variety of high throughput spectroscopy data, including the identification of indirect or direct band gaps in Fe 2O 3, Cu 2V 2O 7, and BiVOmore » 4. Here, the applicability of the algorithm to estimate a range of band gap energies for various materials is demonstrated by a comparison of direct-allowed band gaps estimated by expert scientists and by automated algorithm for 60 optical spectra.« less

  1. Privacy-Preserving Relationship Path Discovery in Social Networks

    NASA Astrophysics Data System (ADS)

    Mezzour, Ghita; Perrig, Adrian; Gligor, Virgil; Papadimitratos, Panos

    As social networks sites continue to proliferate and are being used for an increasing variety of purposes, the privacy risks raised by the full access of social networking sites over user data become uncomfortable. A decentralized social network would help alleviate this problem, but offering the functionalities of social networking sites is a distributed manner is a challenging problem. In this paper, we provide techniques to instantiate one of the core functionalities of social networks: discovery of paths between individuals. Our algorithm preserves the privacy of relationship information, and can operate offline during the path discovery phase. We simulate our algorithm on real social network topologies.

  2. Problem Solving Techniques for the Design of Algorithms.

    ERIC Educational Resources Information Center

    Kant, Elaine; Newell, Allen

    1984-01-01

    Presents model of algorithm design (activity in software development) based on analysis of protocols of two subjects designing three convex hull algorithms. Automation methods, methods for studying algorithm design, role of discovery in problem solving, and comparison of different designs of case study according to model are highlighted.…

  3. A New Scheme to Characterize and Identify Protein Ubiquitination Sites.

    PubMed

    Nguyen, Van-Nui; Huang, Kai-Yao; Huang, Chien-Hsun; Lai, K Robert; Lee, Tzong-Yi

    2017-01-01

    Protein ubiquitination, involving the conjugation of ubiquitin on lysine residue, serves as an important modulator of many cellular functions in eukaryotes. Recent advancements in proteomic technology have stimulated increasing interest in identifying ubiquitination sites. However, most computational tools for predicting ubiquitination sites are focused on small-scale data. With an increasing number of experimentally verified ubiquitination sites, we were motivated to design a predictive model for identifying lysine ubiquitination sites for large-scale proteome dataset. This work assessed not only single features, such as amino acid composition (AAC), amino acid pair composition (AAPC) and evolutionary information, but also the effectiveness of incorporating two or more features into a hybrid approach to model construction. The support vector machine (SVM) was applied to generate the prediction models for ubiquitination site identification. Evaluation by five-fold cross-validation showed that the SVM models learned from the combination of hybrid features delivered a better prediction performance. Additionally, a motif discovery tool, MDDLogo, was adopted to characterize the potential substrate motifs of ubiquitination sites. The SVM models integrating the MDDLogo-identified substrate motifs could yield an average accuracy of 68.70 percent. Furthermore, the independent testing result showed that the MDDLogo-clustered SVM models could provide a promising accuracy (78.50 percent) and perform better than other prediction tools. Two cases have demonstrated the effective prediction of ubiquitination sites with corresponding substrate motifs.

  4. PhyloGibbs-MP: Module Prediction and Discriminative Motif-Finding by Gibbs Sampling

    PubMed Central

    Siddharthan, Rahul

    2008-01-01

    PhyloGibbs, our recent Gibbs-sampling motif-finder, takes phylogeny into account in detecting binding sites for transcription factors in DNA and assigns posterior probabilities to its predictions obtained by sampling the entire configuration space. Here, in an extension called PhyloGibbs-MP, we widen the scope of the program, addressing two major problems in computational regulatory genomics. First, PhyloGibbs-MP can localise predictions to small, undetermined regions of a large input sequence, thus effectively predicting cis-regulatory modules (CRMs) ab initio while simultaneously predicting binding sites in those modules—tasks that are usually done by two separate programs. PhyloGibbs-MP's performance at such ab initio CRM prediction is comparable with or superior to dedicated module-prediction software that use prior knowledge of previously characterised transcription factors. Second, PhyloGibbs-MP can predict motifs that differentiate between two (or more) different groups of regulatory regions, that is, motifs that occur preferentially in one group over the others. While other “discriminative motif-finders” have been published in the literature, PhyloGibbs-MP's implementation has some unique features and flexibility. Benchmarks on synthetic and actual genomic data show that this algorithm is successful at enhancing predictions of differentiating sites and suppressing predictions of common sites and compares with or outperforms other discriminative motif-finders on actual genomic data. Additional enhancements include significant performance and speed improvements, the ability to use “informative priors” on known transcription factors, and the ability to output annotations in a format that can be visualised with the Generic Genome Browser. In stand-alone motif-finding, PhyloGibbs-MP remains competitive, outperforming PhyloGibbs-1.0 and other programs on benchmark data. PMID:18769735

  5. Data Mining Citizen Science Results

    NASA Astrophysics Data System (ADS)

    Borne, K. D.

    2012-12-01

    Scientific discovery from big data is enabled through multiple channels, including data mining (through the application of machine learning algorithms) and human computation (commonly implemented through citizen science tasks). We will describe the results of new data mining experiments on the results from citizen science activities. Discovering patterns, trends, and anomalies in data are among the powerful contributions of citizen science. Establishing scientific algorithms that can subsequently re-discover the same types of patterns, trends, and anomalies in automatic data processing pipelines will ultimately result from the transformation of those human algorithms into computer algorithms, which can then be applied to much larger data collections. Scientific discovery from big data is thus greatly amplified through the marriage of data mining with citizen science.

  6. The rearrangement of motif F in the flavivirus RNA-directed RNA polymerase.

    PubMed

    Potapova, Ulyana; Feranchuk, Sergey; Leonova, Galina; Belikov, Sergei

    2018-03-01

    In the flavivirus genus, the non-structural protein NS5 plays a central role in RNA viral replication and constitutes a major target for drug discovery. One of the prime challenges in the study of NS5 protein is to investigate the interplay between the two protein domains, namely, the RNA-dependent RNA polymerase (RdRp) domain and the methyltransferase (MTase) domain. These investigations could clarify the multiple roles of NS5 protein in the virus life cycle. Here we present the results of sequence analyses and structural bioinformatics studies of NS5 protein, which suggest that the conserved motif F in the NS5 protein could act as a lock which controls the rearrangement of the domains and as a switch in the protein enzymatic activity. Copyright © 2017 Elsevier B.V. All rights reserved.

  7. Ankyrin-repeat containing proteins of microbes: a conserved structure with functional diversity

    PubMed Central

    Al-Khodor, Souhaila; Price, Christopher T.; Kalia, Awdhesh; Kwaik, Yousef Abu

    2009-01-01

    Summary The ankyrin repeat (ANK) is the most common protein-protein interaction motif in nature and predominantly found in eukaryotic proteins. The genome sequencing of various pathogenic or symbiotic bacteria and eukaryotic viruses identified numerous genes encoding ANK-containing proteins that were proposed to have been acquired from eukaryotes by horizontal gene transfer. However, the recent discovery of additional ANK-containing proteins encoded in the genomes of archaea and free-living bacteria suggests either a more ancient origin of the ANK motif or multiple convergent evolution events. Many bacterial pathogens employ various types of secretion systems to deliver ANK-containing proteins into eukaryotic cells where they mimic or manipulate various host functions. Understanding the molecular and biochemical functions of this family of proteins will enhance our understanding of important host-microbe interactions. PMID:19962898

  8. Computational Discovery of Materials Using the Firefly Algorithm

    NASA Astrophysics Data System (ADS)

    Avendaño-Franco, Guillermo; Romero, Aldo

    Our current ability to model physical phenomena accurately, the increase computational power and better algorithms are the driving forces behind the computational discovery and design of novel materials, allowing for virtual characterization before their realization in the laboratory. We present the implementation of a novel firefly algorithm, a population-based algorithm for global optimization for searching the structure/composition space. This novel computation-intensive approach naturally take advantage of concurrency, targeted exploration and still keeping enough diversity. We apply the new method in both periodic and non-periodic structures and we present the implementation challenges and solutions to improve efficiency. The implementation makes use of computational materials databases and network analysis to optimize the search and get insights about the geometric structure of local minima on the energy landscape. The method has been implemented in our software PyChemia, an open-source package for materials discovery. We acknowledge the support of DMREF-NSF 1434897 and the Donors of the American Chemical Society Petroleum Research Fund for partial support of this research under Contract 54075-ND10.

  9. Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach.

    PubMed

    Nielsen, Morten; Lundegaard, Claus; Worning, Peder; Hvid, Christina Sylvester; Lamberth, Kasper; Buus, Søren; Brunak, Søren; Lund, Ole

    2004-06-12

    Prediction of which peptides will bind a specific major histocompatibility complex (MHC) constitutes an important step in identifying potential T-cell epitopes suitable as vaccine candidates. MHC class II binding peptides have a broad length distribution complicating such predictions. Thus, identifying the correct alignment is a crucial part of identifying the core of an MHC class II binding motif. In this context, we wish to describe a novel Gibbs motif sampler method ideally suited for recognizing such weak sequence motifs. The method is based on the Gibbs sampling method, and it incorporates novel features optimized for the task of recognizing the binding motif of MHC classes I and II. The method locates the binding motif in a set of sequences and characterizes the motif in terms of a weight-matrix. Subsequently, the weight-matrix can be applied to identifying effectively potential MHC binding peptides and to guiding the process of rational vaccine design. We apply the motif sampler method to the complex problem of MHC class II binding. The input to the method is amino acid peptide sequences extracted from the public databases of SYFPEITHI and MHCPEP and known to bind to the MHC class II complex HLA-DR4(B1*0401). Prior identification of information-rich (anchor) positions in the binding motif is shown to improve the predictive performance of the Gibbs sampler. Similarly, a consensus solution obtained from an ensemble average over suboptimal solutions is shown to outperform the use of a single optimal solution. In a large-scale benchmark calculation, the performance is quantified using relative operating characteristics curve (ROC) plots and we make a detailed comparison of the performance with that of both the TEPITOPE method and a weight-matrix derived using the conventional alignment algorithm of ClustalW. The calculation demonstrates that the predictive performance of the Gibbs sampler is higher than that of ClustalW and in most cases also higher than that of the TEPITOPE method.

  10. Automated phase mapping with AgileFD and its application to light absorber discovery in the V–Mn–Nb oxide system

    DOE PAGES

    Suram, Santosh K.; Xue, Yexiang; Bai, Junwen; ...

    2016-11-21

    Rapid construction of phase diagrams is a central tenet of combinatorial materials science with accelerated materials discovery efforts often hampered by challenges in interpreting combinatorial X-ray diffraction data sets, which we address by developing AgileFD, an artificial intelligence algorithm that enables rapid phase mapping from a combinatorial library of X-ray diffraction patterns. AgileFD models alloying-based peak shifting through a novel expansion of convolutional nonnegative matrix factorization, which not only improves the identification of constituent phases but also maps their concentration and lattice parameter as a function of composition. By incorporating Gibbs’ phase rule into the algorithm, physically meaningful phase mapsmore » are obtained with unsupervised operation, and more refined solutions are attained by injecting expert knowledge of the system. The algorithm is demonstrated through investigation of the V–Mn–Nb oxide system where decomposition of eight oxide phases, including two with substantial alloying, provides the first phase map for this pseudoternary system. This phase map enables interpretation of high-throughput band gap data, leading to the discovery of new solar light absorbers and the alloying-based tuning of the direct-allowed band gap energy of MnV 2O 6. Lastly, the open-source family of AgileFD algorithms can be implemented into a broad range of high throughput workflows to accelerate materials discovery.« less

  11. Automated phase mapping with AgileFD and its application to light absorber discovery in the V–Mn–Nb oxide system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Suram, Santosh K.; Xue, Yexiang; Bai, Junwen

    Rapid construction of phase diagrams is a central tenet of combinatorial materials science with accelerated materials discovery efforts often hampered by challenges in interpreting combinatorial X-ray diffraction data sets, which we address by developing AgileFD, an artificial intelligence algorithm that enables rapid phase mapping from a combinatorial library of X-ray diffraction patterns. AgileFD models alloying-based peak shifting through a novel expansion of convolutional nonnegative matrix factorization, which not only improves the identification of constituent phases but also maps their concentration and lattice parameter as a function of composition. By incorporating Gibbs’ phase rule into the algorithm, physically meaningful phase mapsmore » are obtained with unsupervised operation, and more refined solutions are attained by injecting expert knowledge of the system. The algorithm is demonstrated through investigation of the V–Mn–Nb oxide system where decomposition of eight oxide phases, including two with substantial alloying, provides the first phase map for this pseudoternary system. This phase map enables interpretation of high-throughput band gap data, leading to the discovery of new solar light absorbers and the alloying-based tuning of the direct-allowed band gap energy of MnV 2O 6. Lastly, the open-source family of AgileFD algorithms can be implemented into a broad range of high throughput workflows to accelerate materials discovery.« less

  12. Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm.

    PubMed

    Martinez, Emmanuel; Alvarez, Mario Moises; Trevino, Victor

    2010-08-01

    Biomarker discovery is a typical application from functional genomics. Due to the large number of genes studied simultaneously in microarray data, feature selection is a key step. Swarm intelligence has emerged as a solution for the feature selection problem. However, swarm intelligence settings for feature selection fail to select small features subsets. We have proposed a swarm intelligence feature selection algorithm based on the initialization and update of only a subset of particles in the swarm. In this study, we tested our algorithm in 11 microarray datasets for brain, leukemia, lung, prostate, and others. We show that the proposed swarm intelligence algorithm successfully increase the classification accuracy and decrease the number of selected features compared to other swarm intelligence methods. Copyright © 2010 Elsevier Ltd. All rights reserved.

  13. Identification of early zygotic genes in the yellow fever mosquito Aedes aegypti and discovery of a motif involved in early zygotic genome activation.

    PubMed

    Biedler, James K; Hu, Wanqi; Tae, Hongseok; Tu, Zhijian

    2012-01-01

    During early embryogenesis the zygotic genome is transcriptionally silent and all mRNAs present are of maternal origin. The maternal-zygotic transition marks the time over which embryogenesis changes its dependence from maternal RNAs to zygotically transcribed RNAs. Here we present the first systematic investigation of early zygotic genes (EZGs) in a mosquito species and focus on genes involved in the onset of transcription during 2-4 hr. We used transcriptome sequencing to identify the "pure" (without maternal expression) EZGs by analyzing transcripts from four embryonic time ranges of 0-2, 2-4, 4-8, and 8-12 hr, which includes the time of cellular blastoderm formation and up to the start of gastrulation. Blast of 16,789 annotated transcripts vs. the transcriptome reads revealed evidence for 63 (P<0.001) and 143 (P<0.05) nonmaternally derived transcripts having a significant increase in expression at 2-4 hr. One third of the 63 EZG transcripts do not have predicted introns compared to 10% of all Ae. aegypti genes. We have confirmed by RT-PCR that zygotic transcription starts as early as 2-3 hours. A degenerate motif VBRGGTA was found to be overrepresented in the upstream sequences of the identified EZGs using a motif identification software called SCOPE. We find evidence for homology between this motif and the TAGteam motif found in Drosophila that has been implicated in EZG activation. A 38 bp sequence in the proximal upstream sequence of a kinesin light chain EZG (KLC2.1) contains two copies of the mosquito motif. This sequence was shown to support EZG transcription by luciferase reporter assays performed on injected early embryos, and confers early zygotic activity to a heterologous promoter from a divergent mosquito species. The results of these studies are consistent with the model of early zygotic genome activation via transcriptional activators, similar to what has been found recently in Drosophila.

  14. A framework for interval-valued information system

    NASA Astrophysics Data System (ADS)

    Yin, Yunfei; Gong, Guanghong; Han, Liang

    2012-09-01

    Interval-valued information system is used to transform the conventional dataset into the interval-valued form. To conduct the interval-valued data mining, we conduct two investigations: (1) construct the interval-valued information system, and (2) conduct the interval-valued knowledge discovery. In constructing the interval-valued information system, we first make the paired attributes in the database discovered, and then, make them stored in the neighbour locations in a common database and regard them as 'one' new field. In conducting the interval-valued knowledge discovery, we utilise some related priori knowledge and regard the priori knowledge as the control objectives; and design an approximate closed-loop control mining system. On the implemented experimental platform (prototype), we conduct the corresponding experiments and compare the proposed algorithms with several typical algorithms, such as the Apriori algorithm, the FP-growth algorithm and the CLOSE+ algorithm. The experimental results show that the interval-valued information system method is more effective than the conventional algorithms in discovering interval-valued patterns.

  15. Motif-Synchronization: A new method for analysis of dynamic brain networks with EEG

    NASA Astrophysics Data System (ADS)

    Rosário, R. S.; Cardoso, P. T.; Muñoz, M. A.; Montoya, P.; Miranda, J. G. V.

    2015-12-01

    The major aim of this work was to propose a new association method known as Motif-Synchronization. This method was developed to provide information about the synchronization degree and direction between two nodes of a network by counting the number of occurrences of some patterns between any two time series. The second objective of this work was to present a new methodology for the analysis of dynamic brain networks, by combining the Time-Varying Graph (TVG) method with a directional association method. We further applied the new algorithms to a set of human electroencephalogram (EEG) signals to perform a dynamic analysis of the brain functional networks (BFN).

  16. Using an improved association rules mining optimization algorithm in web-based mobile-learning system

    NASA Astrophysics Data System (ADS)

    Huang, Yin; Chen, Jianhua; Xiong, Shaojun

    2009-07-01

    Mobile-Learning (M-learning) makes many learners get the advantages of both traditional learning and E-learning. Currently, Web-based Mobile-Learning Systems have created many new ways and defined new relationships between educators and learners. Association rule mining is one of the most important fields in data mining and knowledge discovery in databases. Rules explosion is a serious problem which causes great concerns, as conventional mining algorithms often produce too many rules for decision makers to digest. Since Web-based Mobile-Learning System collects vast amounts of student profile data, data mining and knowledge discovery techniques can be applied to find interesting relationships between attributes of learners, assessments, the solution strategies adopted by learners and so on. Therefore ,this paper focus on a new data-mining algorithm, combined with the advantages of genetic algorithm and simulated annealing algorithm , called ARGSA(Association rules based on an improved Genetic Simulated Annealing Algorithm), to mine the association rules. This paper first takes advantage of the Parallel Genetic Algorithm and Simulated Algorithm designed specifically for discovering association rules. Moreover, the analysis and experiment are also made to show the proposed method is superior to the Apriori algorithm in this Mobile-Learning system.

  17. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

    PubMed

    Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.

  18. CLIP-seq analysis of multi-mapped reads discovers novel functional RNA regulatory sites in the human transcriptome.

    PubMed

    Zhang, Zijun; Xing, Yi

    2017-09-19

    Crosslinking or RNA immunoprecipitation followed by sequencing (CLIP-seq or RIP-seq) allows transcriptome-wide discovery of RNA regulatory sites. As CLIP-seq/RIP-seq reads are short, existing computational tools focus on uniquely mapped reads, while reads mapped to multiple loci are discarded. We present CLAM (CLIP-seq Analysis of Multi-mapped reads). CLAM uses an expectation-maximization algorithm to assign multi-mapped reads and calls peaks combining uniquely and multi-mapped reads. To demonstrate the utility of CLAM, we applied it to a wide range of public CLIP-seq/RIP-seq datasets involving numerous splicing factors, microRNAs and m6A RNA methylation. CLAM recovered a large number of novel RNA regulatory sites inaccessible by uniquely mapped reads. The functional significance of these sites was demonstrated by consensus motif patterns and association with alternative splicing (splicing factors), transcript abundance (AGO2) and mRNA half-life (m6A). CLAM provides a useful tool to discover novel protein-RNA interactions and RNA modification sites from CLIP-seq and RIP-seq data, and reveals the significant contribution of repetitive elements to the RNA regulatory landscape of the human transcriptome. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. The center for causal discovery of biomedical knowledge from big data

    PubMed Central

    Bahar, Ivet; Becich, Michael J; Benos, Panayiotis V; Berg, Jeremy; Espino, Jeremy U; Glymour, Clark; Jacobson, Rebecca Crowley; Kienholz, Michelle; Lee, Adrian V; Lu, Xinghua; Scheines, Richard

    2015-01-01

    The Big Data to Knowledge (BD2K) Center for Causal Discovery is developing and disseminating an integrated set of open source tools that support causal modeling and discovery of biomedical knowledge from large and complex biomedical datasets. The Center integrates teams of biomedical and data scientists focused on the refinement of existing and the development of new constraint-based and Bayesian algorithms based on causal Bayesian networks, the optimization of software for efficient operation in a supercomputing environment, and the testing of algorithms and software developed using real data from 3 representative driving biomedical projects: cancer driver mutations, lung disease, and the functional connectome of the human brain. Associated training activities provide both biomedical and data scientists with the knowledge and skills needed to apply and extend these tools. Collaborative activities with the BD2K Consortium further advance causal discovery tools and integrate tools and resources developed by other centers. PMID:26138794

  20. In silico discovery of metal-organic frameworks for precombustion CO2 capture using a genetic algorithm

    PubMed Central

    Chung, Yongchul G.; Gómez-Gualdrón, Diego A.; Li, Peng; Leperi, Karson T.; Deria, Pravas; Zhang, Hongda; Vermeulen, Nicolaas A.; Stoddart, J. Fraser; You, Fengqi; Hupp, Joseph T.; Farha, Omar K.; Snurr, Randall Q.

    2016-01-01

    Discovery of new adsorbent materials with a high CO2 working capacity could help reduce CO2 emissions from newly commissioned power plants using precombustion carbon capture. High-throughput computational screening efforts can accelerate the discovery of new adsorbents but sometimes require significant computational resources to explore the large space of possible materials. We report the in silico discovery of high-performing adsorbents for precombustion CO2 capture by applying a genetic algorithm to efficiently search a large database of metal-organic frameworks (MOFs) for top candidates. High-performing MOFs identified from the in silico search were synthesized and activated and show a high CO2 working capacity and a high CO2/H2 selectivity. One of the synthesized MOFs shows a higher CO2 working capacity than any MOF reported in the literature under the operating conditions investigated here. PMID:27757420

  1. Algorithms for Discovery of Multiple Markov Boundaries

    PubMed Central

    Statnikov, Alexander; Lytkin, Nikita I.; Lemeire, Jan; Aliferis, Constantin F.

    2013-01-01

    Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. PMID:25285052

  2. Study on online community user motif using web usage mining

    NASA Astrophysics Data System (ADS)

    Alphy, Meera; Sharma, Ajay

    2016-04-01

    The Web usage mining is the application of data mining, which is used to extract useful information from the online community. The World Wide Web contains at least 4.73 billion pages according to Indexed Web and it contains at least 228.52 million pages according Dutch Indexed web on 6th august 2015, Thursday. It’s difficult to get needed data from these billions of web pages in World Wide Web. Here is the importance of web usage mining. Personalizing the search engine helps the web user to identify the most used data in an easy way. It reduces the time consumption; automatic site search and automatic restore the useful sites. This study represents the old techniques to latest techniques used in pattern discovery and analysis in web usage mining from 1996 to 2015. Analyzing user motif helps in the improvement of business, e-commerce, personalisation and improvement of websites.

  3. G =  MAT: linking transcription factor expression and DNA binding data.

    PubMed

    Tretyakov, Konstantin; Laur, Sven; Vilo, Jaak

    2011-01-31

    Transcription factors are proteins that bind to motifs on the DNA and thus affect gene expression regulation. The qualitative description of the corresponding processes is therefore important for a better understanding of essential biological mechanisms. However, wet lab experiments targeted at the discovery of the regulatory interplay between transcription factors and binding sites are expensive. We propose a new, purely computational method for finding putative associations between transcription factors and motifs. This method is based on a linear model that combines sequence information with expression data. We present various methods for model parameter estimation and show, via experiments on simulated data, that these methods are reliable. Finally, we examine the performance of this model on biological data and conclude that it can indeed be used to discover meaningful associations. The developed software is available as a web tool and Scilab source code at http://biit.cs.ut.ee/gmat/.

  4. G = MAT: Linking Transcription Factor Expression and DNA Binding Data

    PubMed Central

    Tretyakov, Konstantin; Laur, Sven; Vilo, Jaak

    2011-01-01

    Transcription factors are proteins that bind to motifs on the DNA and thus affect gene expression regulation. The qualitative description of the corresponding processes is therefore important for a better understanding of essential biological mechanisms. However, wet lab experiments targeted at the discovery of the regulatory interplay between transcription factors and binding sites are expensive. We propose a new, purely computational method for finding putative associations between transcription factors and motifs. This method is based on a linear model that combines sequence information with expression data. We present various methods for model parameter estimation and show, via experiments on simulated data, that these methods are reliable. Finally, we examine the performance of this model on biological data and conclude that it can indeed be used to discover meaningful associations. The developed software is available as a web tool and Scilab source code at http://biit.cs.ut.ee/gmat/. PMID:21297945

  5. Discovery and characterization of new O-methyltransferase from the genome of the lignin-degrading fungus Phanerochaete chrysosporium for enhanced lignin degradation.

    PubMed

    Thanh Mai Pham, Le; Kim, Yong Hwan

    2016-01-01

    Using bioinformatic homology search tools, this study utilized sequence phylogeny, gene organization and conserved motifs to identify members of the family of O-methyltransferases from lignin-degrading fungus Phanerochaete chrysosporium. The heterologous expression and characterization of O-methyltransferases from P. chrysosporium were studied. The expressed protein utilized S-(5'-adenosyl)-L-methionine p-toluenesulfonate salt (SAM) and methylated various free-hydroxyl phenolic compounds at both meta and para site. In the same motif, O-methyltransferases were also identified in other white-rot fungi including Bjerkandera adusta, Ceriporiopsis (Gelatoporia) subvermispora B, and Trametes versicolor. As free-hydroxyl phenolic compounds have been known as inhibitors for lignin peroxidase, the presence of O-methyltransferases in white-rot fungi suggested their biological functions in accelerating lignin degradation in white-rot basidiomycetes by converting those inhibitory groups into non-toxic methylated phenolic ones. Copyright © 2015 Elsevier Inc. All rights reserved.

  6. Discovery of novel antimicrobial peptides with unusual cysteine motifs in dandelion Taraxacum officinale Wigg. flowers.

    PubMed

    Astafieva, A A; Rogozhin, E A; Odintsova, T I; Khadeeva, N V; Grishin, E V; Egorov, Ts A

    2012-08-01

    Three novel antimicrobial peptides designated ToAMP1, ToAMP2 and ToAMP3 were purified from Taraxacum officinale flowers. Their amino acid sequences were determined. The peptides are cationic and cysteine-rich and consist of 38, 44 and 42 amino acid residues for ToAMP1, ToAMP2 and ToAMP3, respectively. Importantly, according to cysteine motifs, the peptides are representatives of two novel previously unknown families of plant antimicrobial peptides. ToAMP1 and ToAMP2 share high sequence identity and belong to 6-Cys-containing antimicrobial peptides, while ToAMP3 is a member of a distinct 8-Cys family. The peptides were shown to display high antimicrobial activity both against fungal and bacterial pathogens, and therefore represent new promising molecules for biotechnological and medicinal applications. Crown Copyright © 2012. Published by Elsevier Inc. All rights reserved.

  7. New partner proteins containing novel internal recognition motif for human Glutaminase Interacting Protein (hGIP)

    PubMed Central

    Zencir, Sevil; Banerjee, Monimoy; Dobson, Melanie J.; Ayaydin, Ferhan; Fodor, Elfrieda Ayaydin; Topcu, Zeki; Mohanty, Smita

    2013-01-01

    Regulation of gene expression in cells is mediated by protein-protein, DNA-protein and receptor-ligand interactions. PDZ (PSD-95/Discs-large/ZO-1) domains are protein–protein interaction modules. PDZ-containing proteins function in the organization of multi-protein complexes controlling spatial and temporal fidelity of intracellular signaling pathways. In general, PDZ proteins possess multiple domains facilitating distinct interactions. The human Glutaminase Interacting Protein (hGIP) is an unusual PDZ protein comprising entirely of a single PDZ domain and plays pivotal roles in many cellular processes through its interaction with the C-terminus of partner proteins. Here, we report the identification by yeast two-hybrid screening of two new hGIP-interacting partners, DTX1 and STAU1. Both proteins lack the typical C-terminal PDZ recognition motif but contain a novel internal hGIP recognition motif recently identified in a phage display library screen. Fluorescence resonance energy transfer and confocal microscopy analysis confirmed the in vivo association of hGIP with DTX1 and STAU1 in mammalian cells validating the previous discovery of S/T-X-V/L-D as a consensus internal motif for hGIP recognition. Similar to hGIP, DTX1 and STAU1 have been implicated in neuronal function. Identification of these new interacting partners furthers our understanding of GIP-regulated signaling cascades and these interactions may represent potential new drug targets in humans. PMID:23395680

  8. Open reading frames associated with cancer in the dark matter of the human genome.

    PubMed

    Delgado, Ana Paula; Brandao, Pamela; Chapado, Maria Julia; Hamid, Sheilin; Narayanan, Ramaswamy

    2014-01-01

    The uncharacterized proteins (open reading frames, ORFs) in the human genome offer an opportunity to discover novel targets for cancer. A systematic analysis of the dark matter of the human proteome for druggability and biomarker discovery is crucial to mining the genome. Numerous data mining tools are available to mine these ORFs to develop a comprehensive knowledge base for future target discovery and validation. Using the Genetic Association Database, the ORFs of the human dark matter proteome were screened for evidence of association with neoplasms. The Phenome-Genome Integrator tool was used to establish phenotypic association with disease traits including cancer. Batch analysis of the tools for protein expression analysis, gene ontology and motifs and domains was used to characterize the ORFs. Sixty-two ORFs were identified for neoplasm association. The expression Quantitative Trait Loci (eQTL) analysis identified thirteen ORFs related to cancer traits. Protein expression, motifs and domain analysis and genome-wide association studies verified the relevance of these OncoORFs in diverse tumors. The OncoORFs are also associated with a wide variety of human diseases and disorders. Our results link the OncoORFs to diverse diseases and disorders. This suggests a complex landscape of the uncharacterized proteome in human diseases. These results open the dark matter of the proteome to novel cancer target research. Copyright© 2014, International Institute of Anticancer Research (Dr. John G. Delinasios), All rights reserved.

  9. A cell-based MHC stabilization assay for the detection of peptide binding to the canine classical class I molecule, DLA-88.

    PubMed

    Ross, Peter; Holmes, Jennifer C; Gojanovich, Gregory S; Hess, Paul R

    2012-12-15

    Identifying immunodominant CTL epitopes is essential for studying CD8+ T-cell responses in populations, but remains difficult, as peptides within antigens typically are too numerous for all to be synthesized and screened. Instead, to facilitate discovery, in silico scanning of proteins for sequences that match the motif, or binding preferences, of the restricting MHC class I allele - the largest determinant of immunodominance - can be used to predict likely candidates. The high false positive rate with this analysis ideally requires binding confirmation, which is obtained routinely by an assay using cell lines such as RMA-S that have defective transporter associated with antigen processing (TAP) machinery, and consequently, few surface class I molecules. The stabilization and resultant increased life-span of peptide-MHC complexes on the cell surface by the addition of true binders validates their identity. To determine whether a similar assay could be developed for dogs, we transfected a prevalent class I allele, DLA-88*50801, into RMA-S. In the BARC3 clone, the recombinant heavy chain was associated with murine β2-microglobulin, and importantly, could differentiate motif-matched and -mismatched peptides by surface MHC stabilization. This work demonstrates the potential to use RMA-S cells transfected with canine alleles as a tool for CTL epitope discovery in this species. Copyright © 2012 Elsevier B.V. All rights reserved.

  10. The utility and limitations of current web-available algorithms to predict peptides recognized by CD4 T cells in response to pathogen infection #

    PubMed Central

    Chaves, Francisco A.; Lee, Alvin H.; Nayak, Jennifer; Richards, Katherine A.; Sant, Andrea J.

    2012-01-01

    The ability to track CD4 T cells elicited in response to pathogen infection or vaccination is critical because of the role these cells play in protective immunity. Coupled with advances in genome sequencing of pathogenic organisms, there is considerable appeal for implementation of computer-based algorithms to predict peptides that bind to the class II molecules, forming the complex recognized by CD4 T cells. Despite recent progress in this area, there is a paucity of data regarding their success in identifying actual pathogen-derived epitopes. In this study, we sought to rigorously evaluate the performance of multiple web-available algorithms by comparing their predictions and our results using purely empirical methods for epitope discovery in influenza that utilized overlapping peptides and cytokine Elispots, for three independent class II molecules. We analyzed the data in different ways, trying to anticipate how an investigator might use these computational tools for epitope discovery. We come to the conclusion that currently available algorithms can indeed facilitate epitope discovery, but all shared a high degree of false positive and false negative predictions. Therefore, efficiencies were low. We also found dramatic disparities among algorithms and between predicted IC50 values and true dissociation rates of peptide:MHC class II complexes. We suggest that improved success of predictive algorithms will depend less on changes in computational methods or increased data sets and more on changes in parameters used to “train” the algorithms that factor in elements of T cell repertoire and peptide acquisition by class II molecules. PMID:22467652

  11. The Roles of Water in the Protein Matrix: A Largely Untapped Resource for Drug Discovery.

    PubMed

    Spyrakis, Francesca; Ahmed, Mostafa H; Bayden, Alexander S; Cozzini, Pietro; Mozzarelli, Andrea; Kellogg, Glen E

    2017-08-24

    The value of thoroughly understanding the thermodynamics specific to a drug discovery/design study is well known. Over the past decade, the crucial roles of water molecules in protein structure, function, and dynamics have also become increasingly appreciated. This Perspective explores water in the biological environment by adopting its point of view in such phenomena. The prevailing thermodynamic models of the past, where water was seen largely in terms of an entropic gain after its displacement by a ligand, are now known to be much too simplistic. We adopt a set of terminology that describes water molecules as being "hot" and "cold", which we have defined as being easy and difficult to displace, respectively. The basis of these designations, which involve both enthalpic and entropic water contributions, are explored in several classes of biomolecules and structural motifs. The hallmarks for characterizing water molecules are examined, and computational tools for evaluating water-centric thermodynamics are reviewed. This Perspective's summary features guidelines for exploiting water molecules in drug discovery.

  12. Use of an Improved Matching Algorithm to Select Scaffolds for Enzyme Design Based on a Complex Active Site Model.

    PubMed

    Huang, Xiaoqiang; Xue, Jing; Lin, Min; Zhu, Yushan

    2016-01-01

    Active site preorganization helps native enzymes electrostatically stabilize the transition state better than the ground state for their primary substrates and achieve significant rate enhancement. In this report, we hypothesize that a complex active site model for active site preorganization modeling should help to create preorganized active site design and afford higher starting activities towards target reactions. Our matching algorithm ProdaMatch was improved by invoking effective pruning strategies and the native active sites for ten scaffolds in a benchmark test set were reproduced. The root-mean squared deviations between the matched transition states and those in the crystal structures were < 1.0 Å for the ten scaffolds, and the repacking calculation results showed that 91% of the hydrogen bonds within the active sites are recovered, indicating that the active sites can be preorganized based on the predicted positions of transition states. The application of the complex active site model for de novo enzyme design was evaluated by scaffold selection using a classic catalytic triad motif for the hydrolysis of p-nitrophenyl acetate. Eighty scaffolds were identified from a scaffold library with 1,491 proteins and four scaffolds were native esterase. Furthermore, enzyme design for complicated substrates was investigated for the hydrolysis of cephalexin using scaffold selection based on two different catalytic motifs. Only three scaffolds were identified from the scaffold library by virtue of the classic catalytic triad-based motif. In contrast, 40 scaffolds were identified using a more flexible, but still preorganized catalytic motif, where one scaffold corresponded to the α-amino acid ester hydrolase that catalyzes the hydrolysis and synthesis of cephalexin. Thus, the complex active site modeling approach for de novo enzyme design with the aid of the improved ProdaMatch program is a promising approach for the creation of active sites with high catalytic efficiencies towards target reactions.

  13. Use of an Improved Matching Algorithm to Select Scaffolds for Enzyme Design Based on a Complex Active Site Model

    PubMed Central

    Huang, Xiaoqiang; Xue, Jing; Lin, Min; Zhu, Yushan

    2016-01-01

    Active site preorganization helps native enzymes electrostatically stabilize the transition state better than the ground state for their primary substrates and achieve significant rate enhancement. In this report, we hypothesize that a complex active site model for active site preorganization modeling should help to create preorganized active site design and afford higher starting activities towards target reactions. Our matching algorithm ProdaMatch was improved by invoking effective pruning strategies and the native active sites for ten scaffolds in a benchmark test set were reproduced. The root-mean squared deviations between the matched transition states and those in the crystal structures were < 1.0 Å for the ten scaffolds, and the repacking calculation results showed that 91% of the hydrogen bonds within the active sites are recovered, indicating that the active sites can be preorganized based on the predicted positions of transition states. The application of the complex active site model for de novo enzyme design was evaluated by scaffold selection using a classic catalytic triad motif for the hydrolysis of p-nitrophenyl acetate. Eighty scaffolds were identified from a scaffold library with 1,491 proteins and four scaffolds were native esterase. Furthermore, enzyme design for complicated substrates was investigated for the hydrolysis of cephalexin using scaffold selection based on two different catalytic motifs. Only three scaffolds were identified from the scaffold library by virtue of the classic catalytic triad-based motif. In contrast, 40 scaffolds were identified using a more flexible, but still preorganized catalytic motif, where one scaffold corresponded to the α-amino acid ester hydrolase that catalyzes the hydrolysis and synthesis of cephalexin. Thus, the complex active site modeling approach for de novo enzyme design with the aid of the improved ProdaMatch program is a promising approach for the creation of active sites with high catalytic efficiencies towards target reactions. PMID:27243223

  14. Parameter discovery in stochastic biological models using simulated annealing and statistical model checking.

    PubMed

    Hussain, Faraz; Jha, Sumit K; Jha, Susmit; Langmead, Christopher J

    2014-01-01

    Stochastic models are increasingly used to study the behaviour of biochemical systems. While the structure of such models is often readily available from first principles, unknown quantitative features of the model are incorporated into the model as parameters. Algorithmic discovery of parameter values from experimentally observed facts remains a challenge for the computational systems biology community. We present a new parameter discovery algorithm that uses simulated annealing, sequential hypothesis testing, and statistical model checking to learn the parameters in a stochastic model. We apply our technique to a model of glucose and insulin metabolism used for in-silico validation of artificial pancreata and demonstrate its effectiveness by developing parallel CUDA-based implementation for parameter synthesis in this model.

  15. A proximity-based graph clustering method for the identification and application of transcription factor clusters.

    PubMed

    Spadafore, Maxwell; Najarian, Kayvan; Boyle, Alan P

    2017-11-29

    Transcription factors (TFs) form a complex regulatory network within the cell that is crucial to cell functioning and human health. While methods to establish where a TF binds to DNA are well established, these methods provide no information describing how TFs interact with one another when they do bind. TFs tend to bind the genome in clusters, and current methods to identify these clusters are either limited in scope, unable to detect relationships beyond motif similarity, or not applied to TF-TF interactions. Here, we present a proximity-based graph clustering approach to identify TF clusters using either ChIP-seq or motif search data. We use TF co-occurrence to construct a filtered, normalized adjacency matrix and use the Markov Clustering Algorithm to partition the graph while maintaining TF-cluster and cluster-cluster interactions. We then apply our graph structure beyond clustering, using it to increase the accuracy of motif-based TFBS searching for an example TF. We show that our method produces small, manageable clusters that encapsulate many known, experimentally validated transcription factor interactions and that our method is capable of capturing interactions that motif similarity methods might miss. Our graph structure is able to significantly increase the accuracy of motif TFBS searching, demonstrating that the TF-TF connections within the graph correlate with biological TF-TF interactions. The interactions identified by our method correspond to biological reality and allow for fast exploration of TF clustering and regulatory dynamics.

  16. Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data.

    PubMed

    Lepre, Jorge; Rice, J Jeremy; Tu, Yuhai; Stolovitzky, Gustavo

    2004-05-01

    Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).

  17. Simple Shared Motifs (SSM) in conserved region of promoters: a new approach to identify co-regulation patterns.

    PubMed

    Gruel, Jérémy; LeBorgne, Michel; LeMeur, Nolwenn; Théret, Nathalie

    2011-09-12

    Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks.

  18. Simple Shared Motifs (SSM) in conserved region of promoters: a new approach to identify co-regulation patterns

    PubMed Central

    2011-01-01

    Background Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Results Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Conclusions Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks. PMID:21910886

  19. Large scale structural optimization of trimetallic Cu-Au-Pt clusters up to 147 atoms

    NASA Astrophysics Data System (ADS)

    Wu, Genhua; Sun, Yan; Wu, Xia; Chen, Run; Wang, Yan

    2017-10-01

    The stable structures of Cu-Au-Pt clusters up to 147 atoms are optimized by using an improved adaptive immune optimization algorithm (AIOA-IC method), in which several motifs, such as decahedron, icosahedron, face centered cubic, sixfold pancake, and Leary tetrahedron, are randomly selected as the inner cores of the starting structures. The structures of Cu8AunPt30-n (n = 1-29), Cu8AunPt47-n (n = 1-46), and partial 75-, 79-, 100-, and 147-atom clusters are analyzed. Cu12Au93Pt42 cluster has onion-like Mackay icosahedral motif. The segregation phenomena of Cu, Au and Pt in clusters are explained by the atomic radius, surface energy, and cohesive energy.

  20. Biclustering sparse binary genomic data.

    PubMed

    van Uitert, Miranda; Meuleman, Wouter; Wessels, Lodewyk

    2008-12-01

    Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.

  1. Stem Cell Hydrogel, Jump-Starting Zika Drug Discovery, and Engineering RNA Recognition.

    PubMed

    Kostic, Milka

    2016-08-18

    Every month the editors of Cell Chemical Biology bring you highlights of the most recent chemical biology literature that impressed them with creativity and potential for follow up work. Our August 2016 selection includes a description of hydrogels with self-tunable stiffness that are used to profile lipid metabolites during stems cell differentiation, a look at whether we can find a drug repurposing solution to Zika virus infection, and an engineered RNA recognition motif (RRM). Copyright © 2016. Published by Elsevier Ltd.

  2. Data Mining.

    ERIC Educational Resources Information Center

    Benoit, Gerald

    2002-01-01

    Discusses data mining (DM) and knowledge discovery in databases (KDD), taking the view that KDD is the larger view of the entire process, with DM emphasizing the cleaning, warehousing, mining, and visualization of knowledge discovery in databases. Highlights include algorithms; users; the Internet; text mining; and information extraction.…

  3. Knowledge Discovery in Databases.

    ERIC Educational Resources Information Center

    Norton, M. Jay

    1999-01-01

    Knowledge discovery in databases (KDD) revolves around the investigation and creation of knowledge, processes, algorithms, and mechanisms for retrieving knowledge from data collections. The article is an introductory overview of KDD. The rationale and environment of its development and applications are discussed. Issues related to database design…

  4. Automated Discovery of Elementary Chemical Reaction Steps Using Freezing String and Berny Optimization Methods.

    PubMed

    Suleimanov, Yury V; Green, William H

    2015-09-08

    We present a simple protocol which allows fully automated discovery of elementary chemical reaction steps using in cooperation double- and single-ended transition-state optimization algorithms--the freezing string and Berny optimization methods, respectively. To demonstrate the utility of the proposed approach, the reactivity of several single-molecule systems of combustion and atmospheric chemistry importance is investigated. The proposed algorithm allowed us to detect without any human intervention not only "known" reaction pathways, manually detected in the previous studies, but also new, previously "unknown", reaction pathways which involve significant atom rearrangements. We believe that applying such a systematic approach to elementary reaction path finding will greatly accelerate the discovery of new chemistry and will lead to more accurate computer simulations of various chemical processes.

  5. Promoter Motifs in NCLDVs: An Evolutionary Perspective

    PubMed Central

    Oliveira, Graziele Pereira; Andrade, Ana Cláudia dos Santos Pereira; Rodrigues, Rodrigo Araújo Lima; Arantes, Thalita Souza; Boratto, Paulo Victor Miranda; Silva, Ludmila Karen dos Santos; Dornas, Fábio Pio; Trindade, Giliane de Souza; Drumond, Betânia Paiva; La Scola, Bernard; Kroon, Erna Geessien; Abrahão, Jônatas Santos

    2017-01-01

    For many years, gene expression in the three cellular domains has been studied in an attempt to discover sequences associated with the regulation of the transcription process. Some specific transcriptional features were described in viruses, although few studies have been devoted to understanding the evolutionary aspects related to the spread of promoter motifs through related viral families. The discovery of giant viruses and the proposition of the new viral order Megavirales that comprise a monophyletic group, named nucleo-cytoplasmic large DNA viruses (NCLDV), raised new questions in the field. Some putative promoter sequences have already been described for some NCLDV members, bringing new insights into the evolutionary history of these complex microorganisms. In this review, we summarize the main aspects of the transcription regulation process in the three domains of life, followed by a systematic description of what is currently known about promoter regions in several NCLDVs. We also discuss how the analysis of the promoter sequences could bring new ideas about the giant viruses’ evolution. Finally, considering a possible common ancestor for the NCLDV group, we discussed possible promoters’ evolutionary scenarios and propose the term “MEGA-box” to designate an ancestor promoter motif (‘TATATAAAATTGA’) that could be evolved gradually by nucleotides’ gain and loss and point mutations. PMID:28117683

  6. Lessons from a tarantula: new insights into myosin interacting-heads motif evolution and its implications on disease.

    PubMed

    Alamo, Lorenzo; Pinto, Antonio; Sulbarán, Guidenn; Mavárez, Jesús; Padrón, Raúl

    2017-09-04

    Tarantula's leg muscle thick filament is the ideal model for the study of the structure and function of skeletal muscle thick filaments. Its analysis has given rise to a series of structural and functional studies, leading, among other things, to the discovery of the myosin interacting-heads motif (IHM). Further electron microscopy (EM) studies have shown the presence of IHM in frozen-hydrated and negatively stained thick filaments of striated, cardiac, and smooth muscle of bilaterians, most showing the IHM parallel to the filament axis. EM studies on negatively stained heavy meromyosin of different species have shown the presence of IHM on sponges, animals that lack muscle, extending the presence of IHM to metazoans. The IHM evolved about 800 MY ago in the ancestor of Metazoa, and independently with functional differences in the lineage leading to the slime mold Dictyostelium discoideum (Mycetozoa). This motif conveys important functional advantages, such as Ca 2+ regulation and ATP energy-saving mechanisms. Recent interest has focused on human IHM structure in order to understand the structural basis underlying various conditions and situations of scientific and medical interest: the hypertrophic and dilated cardiomyopathies, overfeeding control, aging and hormone deprival muscle weakness, drug design for schistosomiasis control, and conditioning exercise physiology for the training of power athletes.

  7. The automated design of materials far from equilibrium

    NASA Astrophysics Data System (ADS)

    Miskin, Marc Z.

    Automated design is emerging as a powerful concept in materials science. By combining computer algorithms, simulations, and experimental data, new techniques are being developed that start with high level functional requirements and identify the ideal materials that achieve them. This represents a radically different picture of how materials become functional in which technological demand drives material discovery, rather than the other way around. At the frontiers of this field, materials systems previously considered too complicated can start to be controlled and understood. Particularly promising are materials far from equilibrium. Material robustness, high strength, self-healing and memory are properties displayed by several materials systems that are intrinsically out of equilibrium. These and other properties could be revolutionary, provided they can first be controlled. This thesis conceptualizes and implements a framework for designing materials that are far from equilibrium. We show how, even in the absence of a complete physical theory, design from the top down is possible and lends itself to producing physical insight. As a prototype system, we work with granular materials: collections of athermal, macroscopic identical objects, since these materials function both as an essential component of industrial processes as well as a model system for many non-equilibrium states of matter. We show that by placing granular materials in the context of design, benefits emerge simultaneously for fundamental and applied interests. As first steps, we use our framework to design granular aggregates with extreme properties like high stiffness, and softness. We demonstrate control over nonlinear effects by producing exotic aggregates that stiffen under compression. Expanding on our framework, we conceptualize new ways of thinking about material design when automatic discovery is possible. We show how to build rules that link particle shapes to arbitrary granular packing density. We examine how the results of a design process are contingent upon operating conditions by studying which shapes dissipate energy fastest in a granular gas. We even move to create optimization algorithms for the expressed purpose of material design, by integrating them with statistical mechanics. In all of these cases, we show that turning to machines puts a fresh perspective on materials far from equilibrium. By matching forms to functions, complexities become possibilities, motifs emerge that describe new physics, and the door opens to rational design.

  8. Complete Genome Analysis of Three Novel Picornaviruses from Diverse Bat Species▿

    PubMed Central

    Lau, Susanna K. P.; Woo, Patrick C. Y.; Lai, Kenneth K. Y.; Huang, Yi; Yip, Cyril C. Y.; Shek, Chung-Tong; Lee, Paul; Lam, Carol S. F.; Chan, Kwok-Hung; Yuen, Kwok-Yung

    2011-01-01

    Although bats are important reservoirs of diverse viruses that can cause human epidemics, little is known about the presence of picornaviruses in these flying mammals. Among 1,108 bats of 18 species studied, three novel picornaviruses (groups 1, 2, and 3) were identified from alimentary specimens of 12 bats from five species and four genera. Two complete genomes, each from the three picornaviruses, were sequenced. Phylogenetic analysis showed that they fell into three distinct clusters in the Picornaviridae family, with low homologies to known picornaviruses, especially in leader and 2A proteins. Moreover, group 1 and 2 viruses are more closely related to each other than to group 3 viruses, which exhibit genome features distinct from those of the former two virus groups. In particular, the group 3 virus genome contains the shortest leader protein within Picornaviridae, a putative type I internal ribosome entry site (IRES) in the 5′-untranslated region instead of the type IV IRES found in group 1 and 2 viruses, one instead of two GXCG motifs in 2A, an L→V substitution in the DDLXQ motif in 2C helicase, and a conserved GXH motif in 3C protease. Group 1 and 2 viruses are unique among picornaviruses in having AMH instead of the GXH motif in 3Cpro. These findings suggest that the three picornaviruses belong to two novel genera in the Picornaviridae family. This report describes the discovery and complete genome analysis of three picornaviruses in bats, and their presence in diverse bat genera/species suggests the ability to cross the species barrier. PMID:21697464

  9. A Relevancy Algorithm for Curating Earth Science Data Around Phenomenon

    NASA Technical Reports Server (NTRS)

    Maskey, Manil; Ramachandran, Rahul; Li, Xiang; Weigel, Amanda; Bugbee, Kaylin; Gatlin, Patrick; Miller, J. J.

    2017-01-01

    Earth science data are being collected for various science needs and applications, processed using different algorithms at multiple resolutions and coverages, and then archived at different archiving centers for distribution and stewardship causing difficulty in data discovery. Curation, which typically occurs in museums, art galleries, and libraries, is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest. Curating data sets around topics or areas of interest addresses some of the data discovery needs in the field of Earth science, especially for unanticipated users of data. This paper describes a methodology to automate search and selection of data around specific phenomena. Different components of the methodology including the assumptions, the process, and the relevancy ranking algorithm are described. The paper makes two unique contributions to improving data search and discovery capabilities. First, the paper describes a novel methodology developed for automatically curating data around a topic using Earthscience metadata records. Second, the methodology has been implemented as a standalone web service that is utilized to augment search and usability of data in a variety of tools.

  10. A relevancy algorithm for curating earth science data around phenomenon

    NASA Astrophysics Data System (ADS)

    Maskey, Manil; Ramachandran, Rahul; Li, Xiang; Weigel, Amanda; Bugbee, Kaylin; Gatlin, Patrick; Miller, J. J.

    2017-09-01

    Earth science data are being collected for various science needs and applications, processed using different algorithms at multiple resolutions and coverages, and then archived at different archiving centers for distribution and stewardship causing difficulty in data discovery. Curation, which typically occurs in museums, art galleries, and libraries, is traditionally defined as the process of collecting and organizing information around a common subject matter or a topic of interest. Curating data sets around topics or areas of interest addresses some of the data discovery needs in the field of Earth science, especially for unanticipated users of data. This paper describes a methodology to automate search and selection of data around specific phenomena. Different components of the methodology including the assumptions, the process, and the relevancy ranking algorithm are described. The paper makes two unique contributions to improving data search and discovery capabilities. First, the paper describes a novel methodology developed for automatically curating data around a topic using Earth science metadata records. Second, the methodology has been implemented as a stand-alone web service that is utilized to augment search and usability of data in a variety of tools.

  11. MLViS: A Web Tool for Machine Learning-Based Virtual Screening in Early-Phase of Drug Discovery and Development

    PubMed Central

    Korkmaz, Selcuk; Zararsiz, Gokmen; Goksuluk, Dincer

    2015-01-01

    Virtual screening is an important step in early-phase of drug discovery process. Since there are thousands of compounds, this step should be both fast and effective in order to distinguish drug-like and nondrug-like molecules. Statistical machine learning methods are widely used in drug discovery studies for classification purpose. Here, we aim to develop a new tool, which can classify molecules as drug-like and nondrug-like based on various machine learning methods, including discriminant, tree-based, kernel-based, ensemble and other algorithms. To construct this tool, first, performances of twenty-three different machine learning algorithms are compared by ten different measures, then, ten best performing algorithms have been selected based on principal component and hierarchical cluster analysis results. Besides classification, this application has also ability to create heat map and dendrogram for visual inspection of the molecules through hierarchical cluster analysis. Moreover, users can connect the PubChem database to download molecular information and to create two-dimensional structures of compounds. This application is freely available through www.biosoft.hacettepe.edu.tr/MLViS/. PMID:25928885

  12. Blind prediction of noncanonical RNA structure at atomic accuracy.

    PubMed

    Watkins, Andrew M; Geniesse, Caleb; Kladwang, Wipapat; Zakrevsky, Paul; Jaeger, Luc; Das, Rhiju

    2018-05-01

    Prediction of RNA structure from nucleotide sequence remains an unsolved grand challenge of biochemistry and requires distinct concepts from protein structure prediction. Despite extensive algorithmic development in recent years, modeling of noncanonical base pairs of new RNA structural motifs has not been achieved in blind challenges. We report a stepwise Monte Carlo (SWM) method with a unique add-and-delete move set that enables predictions of noncanonical base pairs of complex RNA structures. A benchmark of 82 diverse motifs establishes the method's general ability to recover noncanonical pairs ab initio, including multistrand motifs that have been refractory to prior approaches. In a blind challenge, SWM models predicted nucleotide-resolution chemical mapping and compensatory mutagenesis experiments for three in vitro selected tetraloop/receptors with previously unsolved structures (C7.2, C7.10, and R1). As a final test, SWM blindly and correctly predicted all noncanonical pairs of a Zika virus double pseudoknot during a recent community-wide RNA-Puzzle. Stepwise structure formation, as encoded in the SWM method, enables modeling of noncanonical RNA structure in a variety of previously intractable problems.

  13. Exact calculation of distributions on integers, with application to sequence alignment.

    PubMed

    Newberg, Lee A; Lawrence, Charles E

    2009-01-01

    Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.

  14. Function-based classification of carbohydrate-active enzymes by recognition of short, conserved peptide motifs.

    PubMed

    Busk, Peter Kamp; Lange, Lene

    2013-06-01

    Functional prediction of carbohydrate-active enzymes is difficult due to low sequence identity. However, similar enzymes often share a few short motifs, e.g., around the active site, even when the overall sequences are very different. To exploit this notion for functional prediction of carbohydrate-active enzymes, we developed a simple algorithm, peptide pattern recognition (PPR), that can divide proteins into groups of sequences that share a set of short conserved sequences. When this method was used on 118 glycoside hydrolase 5 proteins with 9% average pairwise identity and representing four characterized enzymatic functions, 97% of the proteins were sorted into groups correlating with their enzymatic activity. Furthermore, we analyzed 8,138 glycoside hydrolase 13 proteins including 204 experimentally characterized enzymes with 28 different functions. There was a 91% correlation between group and enzyme activity. These results indicate that the function of carbohydrate-active enzymes can be predicted with high precision by finding short, conserved motifs in their sequences. The glycoside hydrolase 61 family is important for fungal biomass conversion, but only a few proteins of this family have been functionally characterized. Interestingly, PPR divided 743 glycoside hydrolase 61 proteins into 16 subfamilies useful for targeted investigation of the function of these proteins and pinpointed three conserved motifs with putative importance for enzyme activity. Furthermore, the conserved sequences were useful for cloning of new, subfamily-specific glycoside hydrolase 61 proteins from 14 fungi. In conclusion, identification of conserved sequence motifs is a new approach to sequence analysis that can predict carbohydrate-active enzyme functions with high precision.

  15. Discovery of Mixed Pharmacology Melanocortin-3 Agonists and Melanocortin-4 Receptor Tetrapeptide Antagonist Compounds (TACOs) Based on the Sequence Ac-Xaa1-Arg-(pI)DPhe-Xaa4-NH2.

    PubMed

    Doering, Skye R; Freeman, Katie T; Schnell, Sathya M; Haslach, Erica M; Dirain, Marvin; Debevec, Ginamarie; Geer, Phaedra; Santos, Radleigh G; Giulianotti, Marc A; Pinilla, Clemencia; Appel, Jon R; Speth, Robert C; Houghten, Richard A; Haskell-Luevano, Carrie

    2017-05-25

    The centrally expressed melanocortin-3 and -4 receptors (MC3R/MC4R) have been studied as possible targets for weight management therapies, with a preponderance of studies focusing on the MC4R. Herein, a novel tetrapeptide scaffold [Ac-Xaa 1 -Arg-(pI)DPhe-Xaa 4 -NH 2 ] is reported. The scaffold was derived from results obtained from a MC3R mixture-based positional scanning campaign. From these results, a set of 48 tetrapeptides were designed and pharmacologically characterized at the mouse melanocortin-1, -3, -4, and -5 receptors. This resulted in the serendipitous discovery of nine compounds that were MC3R agonists (EC 50 < 1000 nM) and MC4R antagonists (5.7 < pA 2 < 7.8). The three most potent MC3R agonists, 18 [Ac-Arg-Arg-(pI)DPhe-Tic-NH 2 ], 1 [Ac-His-Arg-(pI)DPhe-Tic-NH 2 ], and 41 [Ac-Arg-Arg-(pI)DPhe-DNal(2')-NH 2 ] were more potent (EC 50 < 73 nM) than the melanocortin tetrapeptide Ac-His-DPhe-Arg-Trp-NH 2 . This template contains a sequentially reversed "Arg-(pI)DPhe" motif with respect to the classical "Phe-Arg" melanocortin signaling motif, which results in pharmacology that is first-in-class for the central melanocortin receptors.

  16. Distributed Load Shedding over Directed Communication Networks with Time Delays

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yang, Tao; Wu, Di

    When generation is insufficient to support all loads under emergencies, effective and efficient load shedding needs to be deployed in order to maintain the supply-demand balance. This paper presents a distributed load shedding algorithm, which makes efficient decision based on the discovered global information. In the global information discovery process, each load only communicates with its neighboring load via directed communication links possibly with arbitrarily large but bounded time varying communication delays. We propose a novel distributed information discovery algorithm based on ratio consensus. Simulation results are used to validate the proposed method.

  17. Extraction of consensus protein patterns in regions containing non-proline cis peptide bonds and their functional assessment.

    PubMed

    Exarchos, Konstantinos P; Exarchos, Themis P; Rigas, Georgios; Papaloukas, Costas; Fotiadis, Dimitrios I

    2011-05-10

    In peptides and proteins, only a small percentile of peptide bonds adopts the cis configuration. Especially in the case of amide peptide bonds, the amount of cis conformations is quite limited thus hampering systematic studies, until recently. However, lately the emerging population of databases with more 3D structures of proteins has produced a considerable number of sequences containing non-proline cis formations (cis-nonPro). In our work, we extract regular expression-type patterns that are descriptive of regions surrounding the cis-nonPro formations. For this purpose, three types of pattern discovery are performed: i) exact pattern discovery, ii) pattern discovery using a chemical equivalency set, and iii) pattern discovery using a structural equivalency set. Afterwards, using each pattern as predicate, we search the Eukaryotic Linear Motif (ELM) resource to identify potential functional implications of regions with cis-nonPro peptide bonds. The patterns extracted from each type of pattern discovery are further employed, in order to formulate a pattern-based classifier, which is used to discriminate between cis-nonPro and trans-nonPro formations. In terms of functional implications, we observe a significant association of cis-nonPro peptide bonds towards ligand/binding functionalities. As for the pattern-based classification scheme, the highest results were obtained using the structural equivalency set, which yielded 70% accuracy, 77% sensitivity and 63% specificity.

  18. NNAlign: a platform to construct and evaluate artificial neural network models of receptor-ligand interactions.

    PubMed

    Nielsen, Morten; Andreatta, Massimo

    2017-07-03

    Peptides are extensively used to characterize functional or (linear) structural aspects of receptor-ligand interactions in biological systems, e.g. SH2, SH3, PDZ peptide-recognition domains, the MHC membrane receptors and enzymes such as kinases and phosphatases. NNAlign is a method for the identification of such linear motifs in biological sequences. The algorithm aligns the amino acid or nucleotide sequences provided as training set, and generates a model of the sequence motif detected in the data. The webserver allows setting up cross-validation experiments to estimate the performance of the model, as well as evaluations on independent data. Many features of the training sequences can be encoded as input, and the network architecture is highly customizable. The results returned by the server include a graphical representation of the motif identified by the method, performance values and a downloadable model that can be applied to scan protein sequences for occurrence of the motif. While its performance for the characterization of peptide-MHC interactions is widely documented, we extended NNAlign to be applicable to other receptor-ligand systems as well. Version 2.0 supports alignments with insertions and deletions, encoding of receptor pseudo-sequences, and custom alphabets for the training sequences. The server is available at http://www.cbs.dtu.dk/services/NNAlign-2.0. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. iELM—a web server to explore short linear motif-mediated interactions

    PubMed Central

    Weatheritt, Robert J.; Jehl, Peter; Dinkel, Holger; Gibson, Toby J.

    2012-01-01

    The recent expansion in our knowledge of protein–protein interactions (PPIs) has allowed the annotation and prediction of hundreds of thousands of interactions. However, the function of many of these interactions remains elusive. The interactions of Eukaryotic Linear Motif (iELM) web server provides a resource for predicting the function and positional interface for a subset of interactions mediated by short linear motifs (SLiMs). The iELM prediction algorithm is based on the annotated SLiM classes from the Eukaryotic Linear Motif (ELM) resource and allows users to explore both annotated and user-generated PPI networks for SLiM-mediated interactions. By incorporating the annotated information from the ELM resource, iELM provides functional details of PPIs. This can be used in proteomic analysis, for example, to infer whether an interaction promotes complex formation or degradation. Furthermore, details of the molecular interface of the SLiM-mediated interactions are also predicted. This information is displayed in a fully searchable table, as well as graphically with the modular architecture of the participating proteins extracted from the UniProt and Phospho.ELM resources. A network figure is also presented to aid the interpretation of results. The iELM server supports single protein queries as well as large-scale proteomic submissions and is freely available at http://i.elm.eu.org. PMID:22638578

  20. Matching 4.7-Å XRD Spacing in Amelogenin Nanoribbons and Enamel Matrix

    PubMed Central

    Sanii, B.; Martinez-Avila, O.; Simpliciano, C.; Zuckermann, R.N.; Habelitz, S.

    2014-01-01

    The recent discovery of conditions that induce nanoribbon structures of amelogenin protein in vitro raises questions about their role in enamel formation. Nanoribbons of recombinant human full-length amelogenin (rH174) are about 17 nm wide and self-align into parallel bundles; thus, they could act as templates for crystallization of nanofibrous apatite comprising dental enamel. Here we analyzed the secondary structures of nanoribbon amelogenin by x-ray diffraction (XRD) and Fourier transform infrared spectroscopy (FTIR) and tested if the structural motif matches previous data on the organic matrix of enamel. XRD analysis showed that a peak corresponding to 4.7 Å is present in nanoribbons of amelogenin. In addition, FTIR analysis showed that amelogenin in the form of nanoribbons was comprised of β-sheets by up to 75%, while amelogenin nanospheres had predominantly random-coil structure. The observation of a 4.7-Å XRD spacing confirms the presence of β-sheets and illustrates structural parallels between the in vitro assemblies and structural motifs in developing enamel. PMID:25048248

  1. Identifying transcription factor functions and targets by phenotypic activation

    PubMed Central

    Chua, Gordon; Morris, Quaid D.; Sopko, Richelle; Robinson, Mark D.; Ryan, Owen; Chan, Esther T.; Frey, Brendan J.; Andrews, Brenda J.; Boone, Charles; Hughes, Timothy R.

    2006-01-01

    Mapping transcriptional regulatory networks is difficult because many transcription factors (TFs) are activated only under specific conditions. We describe a generic strategy for identifying genes and pathways induced by individual TFs that does not require knowledge of their normal activation cues. Microarray analysis of 55 yeast TFs that caused a growth phenotype when overexpressed showed that the majority caused increased transcript levels of genes in specific physiological categories, suggesting a mechanism for growth inhibition. Induced genes typically included established targets and genes with consensus promoter motifs, if known, indicating that these data are useful for identifying potential new target genes and binding sites. We identified the sequence 5′-TCACGCAA as a binding sequence for Hms1p, a TF that positively regulates pseudohyphal growth and previously had no known motif. The general strategy outlined here presents a straightforward approach to discovery of TF activities and mapping targets that could be adapted to any organism with transgenic technology. PMID:16880382

  2. Trellis Tone Modulation Multiple-Access for Peer Discovery in D2D Networks

    PubMed Central

    Lim, Chiwoo; Kim, Sang-Hyo

    2018-01-01

    In this paper, a new non-orthogonal multiple-access scheme, trellis tone modulation multiple-access (TTMMA), is proposed for peer discovery of distributed device-to-device (D2D) communication. The range and capacity of discovery are important performance metrics in peer discovery. The proposed trellis tone modulation uses single-tone transmission and achieves a long discovery range due to its low Peak-to-Average Power Ratio (PAPR). The TTMMA also exploits non-orthogonal resource assignment to increase the discovery capacity. For the multi-user detection of superposed multiple-access signals, a message-passing algorithm with supplementary schemes are proposed. With TTMMA and its message-passing demodulation, approximately 1.5 times the number of devices are discovered compared to the conventional frequency division multiple-access (FDMA)-based discovery. PMID:29673167

  3. Trellis Tone Modulation Multiple-Access for Peer Discovery in D2D Networks.

    PubMed

    Lim, Chiwoo; Jang, Min; Kim, Sang-Hyo

    2018-04-17

    In this paper, a new non-orthogonal multiple-access scheme, trellis tone modulation multiple-access (TTMMA), is proposed for peer discovery of distributed device-to-device (D2D) communication. The range and capacity of discovery are important performance metrics in peer discovery. The proposed trellis tone modulation uses single-tone transmission and achieves a long discovery range due to its low Peak-to-Average Power Ratio (PAPR). The TTMMA also exploits non-orthogonal resource assignment to increase the discovery capacity. For the multi-user detection of superposed multiple-access signals, a message-passing algorithm with supplementary schemes are proposed. With TTMMA and its message-passing demodulation, approximately 1.5 times the number of devices are discovered compared to the conventional frequency division multiple-access (FDMA)-based discovery.

  4. Hidden Markov models of biological primary sequence information.

    PubMed Central

    Baldi, P; Chauvin, Y; Hunkapiller, T; McClure, M A

    1994-01-01

    Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences. PMID:8302831

  5. Software engineering capability for Ada (GRASP/Ada Tool)

    NASA Technical Reports Server (NTRS)

    Cross, James H., II

    1995-01-01

    The GRASP/Ada project (Graphical Representations of Algorithms, Structures, and Processes for Ada) has successfully created and prototyped a new algorithmic level graphical representation for Ada software, the Control Structure Diagram (CSD). The primary impetus for creation of the CSD was to improve the comprehension efficiency of Ada software and, as a result, improve reliability and reduce costs. The emphasis has been on the automatic generation of the CSD from Ada PDL or source code to support reverse engineering and maintenance. The CSD has the potential to replace traditional prettyprinted Ada Source code. A new Motif compliant graphical user interface has been developed for the GRASP/Ada prototype.

  6. Sulfonamidation of Aryl and Heteroaryl Halides through Photosensitized Nickel Catalysis.

    PubMed

    Kim, Taehoon; McCarver, Stefan J; Lee, Chulbom; MacMillan, David W C

    2018-03-19

    Herein we report a highly efficient method for nickel-catalyzed C-N bond formation between sulfonamides and aryl electrophiles. This technology provides generic access to a broad range of N-aryl and N-heteroaryl sulfonamide motifs, which are widely represented in drug discovery. Initial mechanistic studies suggest an energy-transfer mechanism wherein C-N bond reductive elimination occurs from a triplet excited Ni II complex. Late-stage sulfonamidation in the synthesis of a pharmacologically relevant structure is also demonstrated. © 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  7. Improved genome-scale multi-target virtual screening via a novel collaborative filtering approach to cold-start problem

    PubMed Central

    Lim, Hansaim; Gray, Paul; Xie, Lei; Poleksic, Aleksandar

    2016-01-01

    Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design. PMID:27958331

  8. Improved genome-scale multi-target virtual screening via a novel collaborative filtering approach to cold-start problem.

    PubMed

    Lim, Hansaim; Gray, Paul; Xie, Lei; Poleksic, Aleksandar

    2016-12-13

    Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design.

  9. A biological compression model and its applications.

    PubMed

    Cao, Minh Duc; Dix, Trevor I; Allison, Lloyd

    2011-01-01

    A biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data. It can be used for repeat element discovery, sequence alignment and phylogenetic analysis. We demonstrate that the model can handle statistically biased sequences and distantly related sequences where conventional knowledge discovery tools often fail.

  10. The center for causal discovery of biomedical knowledge from big data.

    PubMed

    Cooper, Gregory F; Bahar, Ivet; Becich, Michael J; Benos, Panayiotis V; Berg, Jeremy; Espino, Jeremy U; Glymour, Clark; Jacobson, Rebecca Crowley; Kienholz, Michelle; Lee, Adrian V; Lu, Xinghua; Scheines, Richard

    2015-11-01

    The Big Data to Knowledge (BD2K) Center for Causal Discovery is developing and disseminating an integrated set of open source tools that support causal modeling and discovery of biomedical knowledge from large and complex biomedical datasets. The Center integrates teams of biomedical and data scientists focused on the refinement of existing and the development of new constraint-based and Bayesian algorithms based on causal Bayesian networks, the optimization of software for efficient operation in a supercomputing environment, and the testing of algorithms and software developed using real data from 3 representative driving biomedical projects: cancer driver mutations, lung disease, and the functional connectome of the human brain. Associated training activities provide both biomedical and data scientists with the knowledge and skills needed to apply and extend these tools. Collaborative activities with the BD2K Consortium further advance causal discovery tools and integrate tools and resources developed by other centers. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  11. Service-based analysis of biological pathways

    PubMed Central

    Zheng, George; Bouguettaya, Athman

    2009-01-01

    Background Computer-based pathway discovery is concerned with two important objectives: pathway identification and analysis. Conventional mining and modeling approaches aimed at pathway discovery are often effective at achieving either objective, but not both. Such limitations can be effectively tackled leveraging a Web service-based modeling and mining approach. Results Inspired by molecular recognitions and drug discovery processes, we developed a Web service mining tool, named PathExplorer, to discover potentially interesting biological pathways linking service models of biological processes. The tool uses an innovative approach to identify useful pathways based on graph-based hints and service-based simulation verifying user's hypotheses. Conclusion Web service modeling of biological processes allows the easy access and invocation of these processes on the Web. Web service mining techniques described in this paper enable the discovery of biological pathways linking these process service models. Algorithms presented in this paper for automatically highlighting interesting subgraph within an identified pathway network enable the user to formulate hypothesis, which can be tested out using our simulation algorithm that are also described in this paper. PMID:19796403

  12. Dereplication of peptidic natural products through database search of mass spectra

    PubMed Central

    Mohimani, Hosein; Gurevich, Alexey; Mikheenko, Alla; Garg, Neha; Nothias, Louis-Felix; Ninomiya, Akihiro; Takada, Kentaro; Dorrestein, Pieter C.; Pevzner, Pavel A.

    2016-01-01

    Peptidic Natural Products (PNPs) are widely used compounds that include many antibiotics and a variety of other bioactive peptides. While recent breakthroughs in PNP discovery raised the challenge of developing new algorithms for their analysis, identification of PNPs via database search of tandem mass spectra remains an open problem. To address this problem, natural product researchers utilize dereplication strategies that identify known PNPs and lead to the discovery of new ones even in cases when the reference spectra are not present in existing spectral libraries. DEREPLICATOR is a new dereplication algorithm that enabled high-throughput PNP identification and that is compatible with large-scale mass spectrometry-based screening platforms for natural product discovery. After searching nearly one hundred million tandem mass spectra in the Global Natural Products Social (GNPS) molecular networking infrastructure, DEREPLICATOR identified an order of magnitude more PNPs (and their new variants) than any previous dereplication efforts. PMID:27820803

  13. Evidence-Based Diagnostic Algorithm for Glioma: Analysis of the Results of Pathology Panel Review and Molecular Parameters of EORTC 26951 and 26882 Trials.

    PubMed

    Kros, Johan M; Huizer, Karin; Hernández-Laín, Aurelio; Marucci, Gianluca; Michotte, Alex; Pollo, Bianca; Rushing, Elisabeth J; Ribalta, Teresa; French, Pim; Jaminé, David; Bekka, Nawal; Lacombe, Denis; van den Bent, Martin J; Gorlia, Thierry

    2015-06-10

    With the rapid discovery of prognostic and predictive molecular parameters for glioma, the status of histopathology in the diagnostic process should be scrutinized. Our project aimed to construct a diagnostic algorithm for gliomas based on molecular and histologic parameters with independent prognostic values. The pathology slides of 636 patients with gliomas who had been included in EORTC 26951 and 26882 trials were reviewed using virtual microscopy by a panel of six neuropathologists who independently scored 18 histologic features and provided an overall diagnosis. The molecular data for IDH1, 1p/19q loss, EGFR amplification, loss of chromosome 10 and chromosome arm 10q, gain of chromosome 7, and hypermethylation of the promoter of MGMT were available for some of the cases. The slides were divided in discovery (n = 426) and validation sets (n = 210). The diagnostic algorithm resulting from analysis of the discovery set was validated in the latter. In 66% of cases, consensus of overall diagnosis was present. A diagnostic algorithm consisting of two molecular markers and one consensus histologic feature was created by conditional inference tree analysis. The order of prognostic significance was: 1p/19q loss, EGFR amplification, and astrocytic morphology, which resulted in the identification of four diagnostic nodes. Validation of the nodes in the validation set confirmed the prognostic value (P < .001). We succeeded in the creation of a timely diagnostic algorithm for anaplastic glioma based on multivariable analysis of consensus histopathology and molecular parameters. © 2015 by American Society of Clinical Oncology.

  14. Impact of computational structure-based methods on drug discovery.

    PubMed

    Reynolds, Charles H

    2014-01-01

    Structure-based drug design has become an indispensible tool in drug discovery. The emergence of structure-based design is due to gains in structural biology that have provided exponential growth in the number of protein crystal structures, new computational algorithms and approaches for modeling protein-ligand interactions, and the tremendous growth of raw computer power in the last 30 years. Computer modeling and simulation have made major contributions to the discovery of many groundbreaking drugs in recent years. Examples are presented that highlight the evolution of computational structure-based design methodology, and the impact of that methodology on drug discovery.

  15. Distribution and diversity of ribosome binding sites in prokaryotic genomes.

    PubMed

    Omotajo, Damilola; Tate, Travis; Cho, Hyuk; Choudhary, Madhusudan

    2015-08-14

    Prokaryotic translation initiation involves the proper docking, anchoring, and accommodation of mRNA to the 30S ribosomal subunit. Three initiation factors (IF1, IF2, and IF3) and some ribosomal proteins mediate the assembly and activation of the translation initiation complex. Although the interaction between Shine-Dalgarno (SD) sequence and its complementary sequence in the 16S rRNA is important in initiation, some genes lacking an SD ribosome binding site (RBS) are still well expressed. The objective of this study is to examine the pattern of distribution and diversity of RBS in fully sequenced bacterial genomes. The following three hypotheses were tested: SD motifs are prevalent in bacterial genomes; all previously identified SD motifs are uniformly distributed across prokaryotes; and genes with specific cluster of orthologous gene (COG) functions differ in their use of SD motifs. Data for 2,458 bacterial genomes, previously generated by Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm) and currently available at the National Center for Biotechnology Information (NCBI), were analyzed. Of the total genes examined, ~77.0% use an SD RBS, while ~23.0% have no RBS. Majority of the genes with the most common SD motifs are distributed in a manner that is representative of their abundance for each COG functional category, while motifs 13 (5'-GGA-3'/5'-GAG-3'/5'-AGG-3') and 27 (5'-AGGAGG-3') appear to be predominantly used by genes for information storage and processing, and translation and ribosome biogenesis, respectively. These findings suggest that an SD sequence is not obligatory for translation initiation; instead, other signals, such as the RBS spacer, may have an overarching influence on translation of mRNAs. Subsequent analyses of the 5' secondary structure of these mRNAs may provide further insight into the translation initiation mechanism.

  16. Comprehensive peptidomimetic libraries targeting protein-protein interactions.

    PubMed

    Whitby, Landon R; Boger, Dale L

    2012-10-16

    Transient protein-protein interactions (PPIs) are essential components in cellular signaling pathways as well as in important processes such as viral infection, replication, and immune suppression. The unknown or uncharacterized PPIs involved in such interaction networks often represent compelling therapeutic targets for drug discovery. To date, however, the main strategies for discovery of small molecule modulators of PPIs are typically limited to structurally characterized targets. Recent developments in molecular scaffolds that mimic the side chain display of peptide secondary structures have yielded effective designs, but few screening libraries of such mimetics are available to interrogate PPI targets. We initiated a program to prepare a comprehensive small molecule library designed to mimic the three major recognition motifs that mediate PPIs (α-helix, β-turn, and β-strand). Three libraries would be built around templates designed to mimic each such secondary structure and substituted with all triplet combinations of groups representing the 20 natural amino acid side chains. When combined, the three libraries would contain a member capable of mimicking the key interaction and recognition residues of most targetable PPIs. In this Account, we summarize the results of the design, synthesis, and validation of an 8000 member α-helix mimetic library and a 4200 member β-turn mimetic library. We expect that the screening of these libraries will not only provide lead structures against α-helix- or β-turn-mediated protein-protein or peptide-receptor interactions, even if the nature of the interaction is unknown, but also yield key insights into the recognition motif (α-helix or β-turn) and identify the key residues mediating the interaction. Consistent with this expectation, the screening of the libraries against p53/MDM2 and HIV-1 gp41 (α-helix mimetic library) or the opioid receptors (β-turn mimetic library) led to the discovery of library members expected to mimic the known endogenous ligands. These efforts led to the discovery of high-affinity α-helix mimetics (K(i) = 0.7 μM) against HIV-1 gp41 as well as high-affinity and selective β-turn mimetics (K(i) = 80 nM) against the κ-opioid receptor. The results suggest that the use of such comprehensive libraries of peptide secondary structure mimetics, built around effective molecular scaffolds, constitutes a powerful method of interrogating PPIs. These structures provide small molecule modulators of PPI networks for therapeutic target validation, lead compound discovery, and the identification of modulators of biological processes for further study.

  17. Viral Protein Inhibits RISC Activity by Argonaute Binding through Conserved WG/GW Motifs

    PubMed Central

    García-Chapa, Meritxell; López-Moya, Juan José; Burgyán, József

    2010-01-01

    RNA silencing is an evolutionarily conserved sequence-specific gene-inactivation system that also functions as an antiviral mechanism in higher plants and insects. To overcome antiviral RNA silencing, viruses express silencing-suppressor proteins. These viral proteins can target one or more key points in the silencing machinery. Here we show that in Sweet potato mild mottle virus (SPMMV, type member of the Ipomovirus genus, family Potyviridae), the role of silencing suppressor is played by the P1 protein (the largest serine protease among all known potyvirids) despite the presence in its genome of an HC-Pro protein, which, in potyviruses, acts as the suppressor. Using in vivo studies we have demonstrated that SPMMV P1 inhibits si/miRNA-programmed RISC activity. Inhibition of RISC activity occurs by binding P1 to mature high molecular weight RISC, as we have shown by immunoprecipitation. Our results revealed that P1 targets Argonaute1 (AGO1), the catalytic unit of RISC, and that suppressor/binding activities are localized at the N-terminal half of P1. In this region three WG/GW motifs were found resembling the AGO-binding linear peptide motif conserved in metazoans and plants. Site-directed mutagenesis proved that these three motifs are absolutely required for both binding and suppression of AGO1 function. In contrast to other viral silencing suppressors analyzed so far P1 inhibits both existing and de novo formed AGO1 containing RISC complexes. Thus P1 represents a novel RNA silencing suppressor mechanism. The discovery of the molecular bases of P1 mediated silencing suppression may help to get better insight into the function and assembly of the poorly explored multiprotein containing RISC. PMID:20657820

  18. RNA-Targeted Therapeutics.

    PubMed

    Crooke, Stanley T; Witztum, Joseph L; Bennett, C Frank; Baker, Brenda F

    2018-04-03

    RNA-targeted therapies represent a platform for drug discovery involving chemically modified oligonucleotides, a wide range of cellular RNAs, and a novel target-binding motif, Watson-Crick base pairing. Numerous hurdles considered by many to be impassable have been overcome. Today, four RNA-targeted therapies are approved for commercial use for indications as diverse as Spinal Muscular Atrophy (SMA) and reduction of low-density lipoprotein cholesterol (LDL-C) and by routes of administration including subcutaneous, intravitreal, and intrathecal delivery. The technology is efficient and supports approaching "undruggable" targets. Three additional agents are progressing through registration, and more are in clinical development, representing several chemical and structural classes. Moreover, progress in understanding the molecular mechanisms by which these drugs work has led to steadily better clinical performance and a wide range of mechanisms that may be exploited for therapeutic purposes. Here we summarize the progress, future challenges, and opportunities for this drug discovery platform. Copyright © 2018 Elsevier Inc. All rights reserved.

  19. PRIM versus CART in subgroup discovery: when patience is harmful.

    PubMed

    Abu-Hanna, Ameen; Nannings, Barry; Dongelmans, Dave; Hasman, Arie

    2010-10-01

    We systematically compare the established algorithms CART (Classification and Regression Trees) and PRIM (Patient Rule Induction Method) in a subgroup discovery task on a large real-world high-dimensional clinical database. Contrary to current conjectures, PRIM's performance was generally inferior to CART's. PRIM often considered "peeling of" a large chunk of data at a value of a relevant discrete ordinal variable unattractive, ultimately missing an important subgroup. This finding has considerable significance in clinical medicine where ordinal scores are ubiquitous. PRIM's utility in clinical databases would increase when global information about (ordinal) variables is better put to use and when the search algorithm keeps track of alternative solutions.

  20. A Brokering Protocol for Agent-Based Grid Resource Discovery

    NASA Astrophysics Data System (ADS)

    Kang, Jaeyong; Sim, Kwang Mong

    Resource discovery is one of the basic and key aspects in grid resource management, which aims at searching for the suitable resources for satisfying the requirement of users' applications. This paper introduces an agent-based brokering protocol which connects users and providers in grid environments. In particular, it focuses on addressing the problem of connecting users and providers. A connection algorithm that matches advertisements of users and requests from providers based on pre-specified multiple criteria is devised and implemented. The connection algorithm mainly consists of four stages: selection, evaluation, filtering, and recommendation. A series of experiments that were carried out in executing the protocol, and favorable results were obtained.

  1. Translational bioinformatics: linking the molecular world to the clinical world.

    PubMed

    Altman, R B

    2012-06-01

    Translational bioinformatics represents the union of translational medicine and bioinformatics. Translational medicine moves basic biological discoveries from the research bench into the patient-care setting and uses clinical observations to inform basic biology. It focuses on patient care, including the creation of new diagnostics, prognostics, prevention strategies, and therapies based on biological discoveries. Bioinformatics involves algorithms to represent, store, and analyze basic biological data, including DNA sequence, RNA expression, and protein and small-molecule abundance within cells. Translational bioinformatics spans these two fields; it involves the development of algorithms to analyze basic molecular and cellular data with an explicit goal of affecting clinical care.

  2. NNAlign: A Web-Based Prediction Method Allowing Non-Expert End-User Discovery of Sequence Motifs in Quantitative Peptide Data

    PubMed Central

    Andreatta, Massimo; Schafer-Nielsen, Claus; Lund, Ole; Buus, Søren; Nielsen, Morten

    2011-01-01

    Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign. PMID:22073191

  3. NNAlign: a web-based prediction method allowing non-expert end-user discovery of sequence motifs in quantitative peptide data.

    PubMed

    Andreatta, Massimo; Schafer-Nielsen, Claus; Lund, Ole; Buus, Søren; Nielsen, Morten

    2011-01-01

    Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new "omics"-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.

  4. When Heterotrimeric G Proteins Are Not Activated by G Protein-Coupled Receptors: Structural Insights and Evolutionary Conservation.

    PubMed

    DiGiacomo, Vincent; Marivin, Arthur; Garcia-Marcos, Mikel

    2018-01-23

    Heterotrimeric G proteins are signal-transducing switches conserved across eukaryotes. In humans, they work as critical mediators of intercellular communication in the context of virtually any physiological process. While G protein regulation by G protein-coupled receptors (GPCRs) is well-established and has received much attention, it has become recently evident that heterotrimeric G proteins can also be activated by cytoplasmic proteins. However, this alternative mechanism of G protein regulation remains far less studied than GPCR-mediated signaling. This Viewpoint focuses on recent advances in the characterization of a group of nonreceptor proteins that contain a sequence dubbed the "Gα-binding and -activating (GBA) motif". So far, four proteins present in mammals [GIV (also known as Girdin), DAPLE, CALNUC, and NUCB2] and one protein in Caenorhabditis elegans (GBAS-1) have been described as possessing a functional GBA motif. The GBA motif confers guanine nucleotide exchange factor activity on Gαi subunits in vitro and activates G protein signaling in cells. The importance of this mechanism of signal transduction is highlighted by the fact that its dysregulation underlies human diseases, such as cancer, which has made the proteins attractive new candidates for therapeutic intervention. Here we discuss recent discoveries on the structural basis of GBA-mediated activation of G proteins and its evolutionary conservation and compare them with the better-studied mechanism mediated by GPCRs.

  5. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hamdani, Hazrina Yusof, E-mail: hazrina@mfrlab.org; Advanced Medical and Dental Institute, Universiti Sains Malaysia, Bertam, Kepala Batas; Artymiuk, Peter J., E-mail: p.artymiuk@sheffield.ac.uk

    A fundamental understanding of the atomic level interactions in ribonucleic acid (RNA) and how they contribute towards RNA architecture is an important knowledge platform to develop through the discovery of motifs from simple arrangements base pairs, to more complex arrangements such as triples and larger patterns involving non-standard interactions. The network of hydrogen bond interactions is important in connecting bases to form potential tertiary motifs. Therefore, there is an urgent need for the development of automated methods for annotating RNA 3D structures based on hydrogen bond interactions. COnnection tables Graphs for Nucleic ACids (COGNAC) is automated annotation system using graphmore » theoretical approaches that has been developed for the identification of RNA 3D motifs. This program searches for patterns in the unbroken networks of hydrogen bonds for RNA structures and capable of annotating base pairs and higher-order base interactions, which ranges from triples to sextuples. COGNAC was able to discover 22 out of 32 quadruples occurrences of the Haloarcula marismortui large ribosomal subunit (PDB ID: 1FFK) and two out of three occurrences of quintuple interaction reported by the non-canonical interactions in RNA (NCIR) database. These and several other interactions of interest will be discussed in this paper. These examples demonstrate that the COGNAC program can serve as an automated annotation system that can be used to annotate conserved base-base interactions and could be added as additional information to established RNA secondary structure prediction methods.« less

  6. Regulation of gene expression by the BLM helicase correlates with the presence of G-quadruplex DNA motifs

    PubMed Central

    Nguyen, Giang Huong; Tang, Weiliang; Robles, Ana I.; Beyer, Richard P.; Gray, Lucas T.; Welsh, Judith A.; Schetter, Aaron J.; Kumamoto, Kensuke; Wang, Xin Wei; Hickson, Ian D.; Maizels, Nancy; Monnat, Raymond J.; Harris, Curtis C.

    2014-01-01

    Bloom syndrome is a rare autosomal recessive disorder characterized by genetic instability and cancer predisposition, and caused by mutations in the gene encoding the Bloom syndrome, RecQ helicase-like (BLM) protein. To determine whether altered gene expression might be responsible for pathological features of Bloom syndrome, we analyzed mRNA and microRNA (miRNA) expression in fibroblasts from individuals with Bloom syndrome and in BLM-depleted control fibroblasts. We identified mRNA and miRNA expression differences in Bloom syndrome patient and BLM-depleted cells. Differentially expressed mRNAs are connected with cell proliferation, survival, and molecular mechanisms of cancer, and differentially expressed miRNAs target genes involved in cancer and in immune function. These and additional altered functions or pathways may contribute to the proportional dwarfism, elevated cancer risk, immune dysfunction, and other features observed in Bloom syndrome individuals. BLM binds to G-quadruplex (G4) DNA, and G4 motifs were enriched at transcription start sites (TSS) and especially within first introns (false discovery rate ≤ 0.001) of differentially expressed mRNAs in Bloom syndrome compared with normal cells, suggesting that G-quadruplex structures formed at these motifs are physiologic targets for BLM. These results identify a network of mRNAs and miRNAs that may drive the pathogenesis of Bloom syndrome. PMID:24958861

  7. Regulation of gene expression by the BLM helicase correlates with the presence of G-quadruplex DNA motifs.

    PubMed

    Nguyen, Giang Huong; Tang, Weiliang; Robles, Ana I; Beyer, Richard P; Gray, Lucas T; Welsh, Judith A; Schetter, Aaron J; Kumamoto, Kensuke; Wang, Xin Wei; Hickson, Ian D; Maizels, Nancy; Monnat, Raymond J; Harris, Curtis C

    2014-07-08

    Bloom syndrome is a rare autosomal recessive disorder characterized by genetic instability and cancer predisposition, and caused by mutations in the gene encoding the Bloom syndrome, RecQ helicase-like (BLM) protein. To determine whether altered gene expression might be responsible for pathological features of Bloom syndrome, we analyzed mRNA and microRNA (miRNA) expression in fibroblasts from individuals with Bloom syndrome and in BLM-depleted control fibroblasts. We identified mRNA and miRNA expression differences in Bloom syndrome patient and BLM-depleted cells. Differentially expressed mRNAs are connected with cell proliferation, survival, and molecular mechanisms of cancer, and differentially expressed miRNAs target genes involved in cancer and in immune function. These and additional altered functions or pathways may contribute to the proportional dwarfism, elevated cancer risk, immune dysfunction, and other features observed in Bloom syndrome individuals. BLM binds to G-quadruplex (G4) DNA, and G4 motifs were enriched at transcription start sites (TSS) and especially within first introns (false discovery rate ≤ 0.001) of differentially expressed mRNAs in Bloom syndrome compared with normal cells, suggesting that G-quadruplex structures formed at these motifs are physiologic targets for BLM. These results identify a network of mRNAs and miRNAs that may drive the pathogenesis of Bloom syndrome.

  8. Simultaneous alignment and clustering of peptide data using a Gibbs sampling approach.

    PubMed

    Andreatta, Massimo; Lund, Ole; Nielsen, Morten

    2013-01-01

    Proteins recognizing short peptide fragments play a central role in cellular signaling. As a result of high-throughput technologies, peptide-binding protein specificities can be studied using large peptide libraries at dramatically lower cost and time. Interpretation of such large peptide datasets, however, is a complex task, especially when the data contain multiple receptor binding motifs, and/or the motifs are found at different locations within distinct peptides. The algorithm presented in this article, based on Gibbs sampling, identifies multiple specificities in peptide data by performing two essential tasks simultaneously: alignment and clustering of peptide data. We apply the method to de-convolute binding motifs in a panel of peptide datasets with different degrees of complexity spanning from the simplest case of pre-aligned fixed-length peptides to cases of unaligned peptide datasets of variable length. Example applications described in this article include mixtures of binders to different MHC class I and class II alleles, distinct classes of ligands for SH3 domains and sub-specificities of the HLA-A*02:01 molecule. The Gibbs clustering method is available online as a web server at http://www.cbs.dtu.dk/services/GibbsCluster.

  9. Parallel implementation of D-Phylo algorithm for maximum likelihood clusters.

    PubMed

    Malik, Shamita; Sharma, Dolly; Khatri, Sunil Kumar

    2017-03-01

    This study explains a newly developed parallel algorithm for phylogenetic analysis of DNA sequences. The newly designed D-Phylo is a more advanced algorithm for phylogenetic analysis using maximum likelihood approach. The D-Phylo while misusing the seeking capacity of k -means keeps away from its real constraint of getting stuck at privately conserved motifs. The authors have tested the behaviour of D-Phylo on Amazon Linux Amazon Machine Image(Hardware Virtual Machine)i2.4xlarge, six central processing unit, 122 GiB memory, 8  ×  800 Solid-state drive Elastic Block Store volume, high network performance up to 15 processors for several real-life datasets. Distributing the clusters evenly on all the processors provides us the capacity to accomplish a near direct speed if there should arise an occurrence of huge number of processors.

  10. Proof-Term Synthesis on Dependent-Type Systems via Explicit Substitutions

    DTIC Science & Technology

    1999-11-01

    oriented functional language OCaml , in about 50 lines. We have also implemented a higher-order unification algorithm for ground expressions. The soundness... OCaml , and it is electronically available by contacting the author. The underlying theory of the method proposed here is the An^-calculus. We believe...CORNES, Conception d’un langage de haut niveau de representation de preuves: recurrence par filtrage de motifs, unification en presence de types

  11. A prior-based integrative framework for functional transcriptional regulatory network inference

    PubMed Central

    Siahpirani, Alireza F.

    2017-01-01

    Abstract Transcriptional regulatory networks specify regulatory proteins controlling the context-specific expression levels of genes. Inference of genome-wide regulatory networks is central to understanding gene regulation, but remains an open challenge. Expression-based network inference is among the most popular methods to infer regulatory networks, however, networks inferred from such methods have low overlap with experimentally derived (e.g. ChIP-chip and transcription factor (TF) knockouts) networks. Currently we have a limited understanding of this discrepancy. To address this gap, we first develop a regulatory network inference algorithm, based on probabilistic graphical models, to integrate expression with auxiliary datasets supporting a regulatory edge. Second, we comprehensively analyze our and other state-of-the-art methods on different expression perturbation datasets. Networks inferred by integrating sequence-specific motifs with expression have substantially greater agreement with experimentally derived networks, while remaining more predictive of expression than motif-based networks. Our analysis suggests natural genetic variation as the most informative perturbation for network inference, and, identifies core TFs whose targets are predictable from expression. Multiple reasons make the identification of targets of other TFs difficult, including network architecture and insufficient variation of TF mRNA level. Finally, we demonstrate the utility of our inference algorithm to infer stress-specific regulatory networks and for regulator prioritization. PMID:27794550

  12. Identification of Patients with Family History of Pancreatic Cancer--Investigation of an NLP System Portability.

    PubMed

    Mehrabi, Saeed; Krishnan, Anand; Roch, Alexandra M; Schmidt, Heidi; Li, DingCheng; Kesterson, Joe; Beesley, Chris; Dexter, Paul; Schmidt, Max; Palakal, Mathew; Liu, Hongfang

    2015-01-01

    In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance.

  13. Closed-Loop Multitarget Optimization for Discovery of New Emulsion Polymerization Recipes

    PubMed Central

    2015-01-01

    Self-optimization of chemical reactions enables faster optimization of reaction conditions or discovery of molecules with required target properties. The technology of self-optimization has been expanded to discovery of new process recipes for manufacture of complex functional products. A new machine-learning algorithm, specifically designed for multiobjective target optimization with an explicit aim to minimize the number of “expensive” experiments, guides the discovery process. This “black-box” approach assumes no a priori knowledge of chemical system and hence particularly suited to rapid development of processes to manufacture specialist low-volume, high-value products. The approach was demonstrated in discovery of process recipes for a semibatch emulsion copolymerization, targeting a specific particle size and full conversion. PMID:26435638

  14. Domain-specific Web Service Discovery with Service Class Descriptions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rocco, D; Caverlee, J; Liu, L

    2005-02-14

    This paper presents DynaBot, a domain-specific web service discovery system. The core idea of the DynaBot service discovery system is to use domain-specific service class descriptions powered by an intelligent Deep Web crawler. In contrast to current registry-based service discovery systems--like the several available UDDI registries--DynaBot promotes focused crawling of the Deep Web of services and discovers candidate services that are relevant to the domain of interest. It uses intelligent filtering algorithms to match services found by focused crawling with the domain-specific service class descriptions. We demonstrate the capability of DynaBot through the BLAST service discovery scenario and describe ourmore » initial experience with DynaBot.« less

  15. Bioinformatics in translational drug discovery.

    PubMed

    Wooller, Sarah K; Benstead-Hume, Graeme; Chen, Xiangrong; Ali, Yusuf; Pearl, Frances M G

    2017-08-31

    Bioinformatics approaches are becoming ever more essential in translational drug discovery both in academia and within the pharmaceutical industry. Computational exploitation of the increasing volumes of data generated during all phases of drug discovery is enabling key challenges of the process to be addressed. Here, we highlight some of the areas in which bioinformatics resources and methods are being developed to support the drug discovery pipeline. These include the creation of large data warehouses, bioinformatics algorithms to analyse 'big data' that identify novel drug targets and/or biomarkers, programs to assess the tractability of targets, and prediction of repositioning opportunities that use licensed drugs to treat additional indications. © 2017 The Author(s).

  16. [GNU Pattern: open source pattern hunter for biological sequences based on SPLASH algorithm].

    PubMed

    Xu, Ying; Li, Yi-xue; Kong, Xiang-yin

    2005-06-01

    To construct a high performance open source software engine based on IBM SPLASH algorithm for later research on pattern discovery. Gpat, which is based on SPLASH algorithm, was developed by using open source software. GNU Pattern (Gpat) software was developped, which efficiently implemented the core part of SPLASH algorithm. Full source code of Gpat was also available for other researchers to modify the program under the GNU license. Gpat is a successful implementation of SPLASH algorithm and can be used as a basic framework for later research on pattern recognition in biological sequences.

  17. An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data.

    PubMed

    Hsu, Arthur L; Tang, Sen-Lin; Halgamuge, Saman K

    2003-11-01

    Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). JAVA software of dynamic SOM tree algorithm is available upon request for academic use. A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf

  18. Specificity determinants for the abscisic acid response element.

    PubMed

    Sarkar, Aditya Kumar; Lahiri, Ansuman

    2013-01-01

    Abscisic acid (ABA) response elements (ABREs) are a group of cis-acting DNA elements that have been identified from promoter analysis of many ABA-regulated genes in plants. We are interested in understanding the mechanism of binding specificity between ABREs and a class of bZIP transcription factors known as ABRE binding factors (ABFs). In this work, we have modeled the homodimeric structure of the bZIP domain of ABRE binding factor 1 from Arabidopsis thaliana (AtABF1) and studied its interaction with ACGT core motif-containing ABRE sequences. We have also examined the variation in the stability of the protein-DNA complex upon mutating ABRE sequences using the protein design algorithm FoldX. The high throughput free energy calculations successfully predicted the ability of ABF1 to bind to alternative core motifs like GCGT or AAGT and also rationalized the role of the flanking sequences in determining the specificity of the protein-DNA interaction.

  19. Synthesis of most polyene natural product motifs using just 12 building blocks and one coupling reaction.

    PubMed

    Woerly, Eric M; Roy, Jahnabi; Burke, Martin D

    2014-06-01

    The inherent modularity of polypeptides, oligonucleotides and oligosaccharides has been harnessed to achieve generalized synthesis platforms. Importantly, like these other targets, most small-molecule natural products are biosynthesized via iterative coupling of bifunctional building blocks. This suggests that many small molecules also possess inherent modularity commensurate with systematic building block-based construction. Supporting this hypothesis, here we report that the polyene motifs found in >75% of all known polyene natural products can be synthesized using just 12 building blocks and one coupling reaction. Using the same general retrosynthetic algorithm and reaction conditions, this platform enabled both the synthesis of a wide range of polyene frameworks that covered all of this natural-product chemical space and the first total syntheses of the polyene natural products asnipyrone B, physarigin A and neurosporaxanthin b-D-glucopyranoside. Collectively, these results suggest the potential for a more generalized approach to making small molecules in the laboratory.

  20. Synthesis of most polyene natural product motifs using just 12 building blocks and one coupling reaction

    NASA Astrophysics Data System (ADS)

    Woerly, Eric M.; Roy, Jahnabi; Burke, Martin D.

    2014-06-01

    The inherent modularity of polypeptides, oligonucleotides and oligosaccharides has been harnessed to achieve generalized synthesis platforms. Importantly, like these other targets, most small-molecule natural products are biosynthesized via iterative coupling of bifunctional building blocks. This suggests that many small molecules also possess inherent modularity commensurate with systematic building block-based construction. Supporting this hypothesis, here we report that the polyene motifs found in >75% of all known polyene natural products can be synthesized using just 12 building blocks and one coupling reaction. Using the same general retrosynthetic algorithm and reaction conditions, this platform enabled both the synthesis of a wide range of polyene frameworks that covered all of this natural-product chemical space and the first total syntheses of the polyene natural products asnipyrone B, physarigin A and neurosporaxanthin β-D-glucopyranoside. Collectively, these results suggest the potential for a more generalized approach to making small molecules in the laboratory.

  1. Targeted Fluoro Positioning for the Discovery of a Potent and Highly Selective Matrix Metalloproteinase Inhibitor.

    PubMed

    Fischer, Thomas; Riedl, Rainer

    2017-04-01

    Invited for this month's cover picture is the group of Professor Rainer Riedl from the Institute of Chemistry and Biotechnology at the Zurich University of Applied Sciences (ZHAW), Switzerland. The cover picture depicts the structure-based design of a drug-like small molecule inhibitor of matrix metalloproteinase-13 (MMP-13) with a combined dual binding motif. The targeted introduction of a single fluoro atom was of vital importance for the optimization of the inhibitor. For more details, read the full text of the Communication at 10.1002/open.201600158.

  2. Discovery of an α-Amino C–H Arylation Reaction Using the Strategy of Accelerated Serendipity

    PubMed Central

    McNally, Andrew; Prier, Christopher K.; MacMillan, David W. C.

    2012-01-01

    Serendipity has long been a welcome yet elusive phenomenon in the advancement of chemistry. We sought to exploit serendipity as a means of rapidly identifying unanticipated chemical transformations. By using a high-throughput, automated workflow and evaluating a large number of random reactions, we have discovered a photoredox-catalyzed C–H arylation reaction for the construction of benzylic amines, an important structural motif within pharmaceutical compounds that is not readily accessed via simple substrates. The mechanism directly couples tertiary amines with cyanoaromatics by using mild and operationally trivial conditions. PMID:22116882

  3. Bitter-tasting and kokumi-enhancing molecules in thermally processed avocado (Persea americana Mill.).

    PubMed

    Degenhardt, Andreas Georg; Hofmann, Thomas

    2010-12-22

    Sequential application of solvent extraction and RP-HPLC in combination with taste dilution analyses (TDA) and comparative TDA, followed by LC-MS and 1D/2D NMR experiments, led to the discovery of 10 C(17)-C(21) oxylipins with 1,2,4-trihydroxy-, 1-acetoxy-2,4-dihydroxy-, and 1-acetoxy-2-hydroxy-4-oxo motifs, respectively, besides 1-O-stearoyl-glycerol and 1-O-linoleoyl-glycerol as bitter-tasting compounds in thermally processed avocado (Persea americana Mill.). On the basis of quantitative data, dose-over-threshold (DoT) factors, and taste re-engineering experiments, these phytochemicals, among which 1-acetoxy-2-hydroxy-4-oxo-octadeca-12-ene was found with the highest taste impact, were confirmed to be the key contributors to the bitter off-taste developed upon thermal processing of avocado. For the first time, those C(17)-C(21) oxylipins exhibiting a 1-acetoxy-2,4-dihydroxy- and a 1-acetoxy-2-hydroxy-4-oxo motif, respectively, were discovered to induce a mouthfulness (kokumi)-enhancing activity in sub-bitter threshold concentrations.

  4. Insights on genome size evolution from a miniature inverted repeat transposon driving a satellite DNA.

    PubMed

    Scalvenzi, Thibault; Pollet, Nicolas

    2014-12-01

    The genome size in eukaryotes does not correlate well with the number of genes they contain. We can observe this so-called C-value paradox in amphibian species. By analyzing an amphibian genome we asked how repetitive DNA can impact genome size and architecture. We describe here our discovery of a Tc1/mariner miniature inverted-repeat transposon family present in Xenopus frogs. These transposons named miDNA4 are unique since they contain a satellite DNA motif. We found that miDNA4 measured 331 bp, contained 25 bp long inverted terminal repeat sequences and a sequence motif of 119 bp present as a unique copy or as an array of 2-47 copies. We characterized the structure, dynamics, impact and evolution of the miDNA4 family and its satellite DNA in Xenopus frog genomes. This led us to propose a model for the evolution of these two repeated sequences and how they can synergize to increase genome size. Copyright © 2014 Elsevier Inc. All rights reserved.

  5. Peptide Array X-Linking (PAX): A New Peptide-Protein Identification Approach

    PubMed Central

    Okada, Hirokazu; Uezu, Akiyoshi; Soderblom, Erik J.; Moseley, M. Arthur; Gertler, Frank B.; Soderling, Scott H.

    2012-01-01

    Many protein interaction domains bind short peptides based on canonical sequence consensus motifs. Here we report the development of a peptide array-based proteomics tool to identify proteins directly interacting with ligand peptides from cell lysates. Array-formatted bait peptides containing an amino acid-derived cross-linker are photo-induced to crosslink with interacting proteins from lysates of interest. Indirect associations are removed by high stringency washes under denaturing conditions. Covalently trapped proteins are subsequently identified by LC-MS/MS and screened by cluster analysis and domain scanning. We apply this methodology to peptides with different proline-containing consensus sequences and show successful identifications from brain lysates of known and novel proteins containing polyproline motif-binding domains such as EH, EVH1, SH3, WW domains. These results suggest the capacity of arrayed peptide ligands to capture and subsequently identify proteins by mass spectrometry is relatively broad and robust. Additionally, the approach is rapid and applicable to cell or tissue fractions from any source, making the approach a flexible tool for initial protein-protein interaction discovery. PMID:22606326

  6. Matching 4.7-Å XRD spacing in amelogenin nanoribbons and enamel matrix.

    PubMed

    Sanii, B; Martinez-Avila, O; Simpliciano, C; Zuckermann, R N; Habelitz, S

    2014-09-01

    The recent discovery of conditions that induce nanoribbon structures of amelogenin protein in vitro raises questions about their role in enamel formation. Nanoribbons of recombinant human full-length amelogenin (rH174) are about 17 nm wide and self-align into parallel bundles; thus, they could act as templates for crystallization of nanofibrous apatite comprising dental enamel. Here we analyzed the secondary structures of nanoribbon amelogenin by x-ray diffraction (XRD) and Fourier transform infrared spectroscopy (FTIR) and tested if the structural motif matches previous data on the organic matrix of enamel. XRD analysis showed that a peak corresponding to 4.7 Å is present in nanoribbons of amelogenin. In addition, FTIR analysis showed that amelogenin in the form of nanoribbons was comprised of β-sheets by up to 75%, while amelogenin nanospheres had predominantly random-coil structure. The observation of a 4.7-Å XRD spacing confirms the presence of β-sheets and illustrates structural parallels between the in vitro assemblies and structural motifs in developing enamel. © International & American Associations for Dental Research.

  7. Ms2lda.org: web-based topic modelling for substructure discovery in mass spectrometry.

    PubMed

    Wandy, Joe; Zhu, Yunfeng; van der Hooft, Justin J J; Daly, Rónán; Barrett, Michael P; Rogers, Simon

    2017-09-14

    We recently published MS2LDA, a method for the decomposition of sets of molecular fragment data derived from large metabolomics experiments. To make the method more widely available to the community, here we present ms2lda.org, a web application that allows users to upload their data, run MS2LDA analyses and explore the results through interactive visualisations. Ms2lda.org takes tandem mass spectrometry data in many standard formats and allows the user to infer the sets of fragment and neutral loss features that co-occur together (Mass2Motifs). As an alternative workflow, the user can also decompose a dataset onto predefined Mass2Motifs. This is accomplished through the web interface or programmatically from our web service. The website can be found at http://ms2lda.org , while the source code is available at https://github.com/sdrogers/ms2ldaviz under the MIT license. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  8. Mining nitrate concentration patterns from high-frequency in situ monitoring: a step towards more detailed understanding of hydrological processes?

    NASA Astrophysics Data System (ADS)

    Aubert, Alice; Houska, Tobias; Plesca, Ina; Kraft, Philipp; Breuer, Lutz

    2015-04-01

    Recently developed sensing technics allow collecting a considerable amount of high-frequency data; not only for hydrologic parameters (water levels, rainfall, etc.) but also for water chemistry. With devices such as in situ spectrophotometer, nitrate concentration can be monitored down to sub-hourly intervals. Thus, opening the way to new questions: what about daily or sub-daily instream nitrate concentration variations? What do these newly observed variations tell us about hydrological processes? In the Vollnkirchener Bach catchment, a headwater creek flows through a human impacted landscape dominated by agricultural and forest use and including a small settlement. Since March 2013, a Pro-PS device has been installed at the gauging station (monitored since 2011). Nitrate concentration is measured every 15 minutes, discharge and water temperature every 5 minutes. Data mining, more precisely motif discovery, is performed on these time series to identify high-resolution patterns. Spectral analysis highlighted that, in data measured at sub-hourly sampling frequency, variations up to a few hours are more likely to be dominated by measurement noise rather than real-world fluctuations. Therefore, we focus on daily motifs and flood patterns (given the fact that hydrological conditions are changing during flood events, we assume that nitrate concentration changes are depicting real processes). Various flood motifs were extracted: (1) nitrate can either be diluted or (2) concentrated, or (3) both (dilution followed by a bumpy recession curve indicating nitrate enrichment at the end of the flood). In addition to these classical nutrient-discharge behaviors, a variety of other interesting motifs were highlighted. (4) A daily nitrate cycle is clearly observed, but only during a specific year period. (5) Lag to peak time between parameters differentiate flood patterns: sometimes nitrate peaks first, sometimes discharge peaks first. (6) Furthermore, we are able to pinpoint the contributions of a combined sewer overflow, as it creates a different motif from diffuse nitrate inflows from adjacent agricultural fields. We look into the other hydrological parameters to explain this variety of patterns and their occurrence time.

  9. Causal discovery in the geosciences-Using synthetic data to learn how to interpret results

    NASA Astrophysics Data System (ADS)

    Ebert-Uphoff, Imme; Deng, Yi

    2017-02-01

    Causal discovery algorithms based on probabilistic graphical models have recently emerged in geoscience applications for the identification and visualization of dynamical processes. The key idea is to learn the structure of a graphical model from observed spatio-temporal data, thus finding pathways of interactions in the observed physical system. Studying those pathways allows geoscientists to learn subtle details about the underlying dynamical mechanisms governing our planet. Initial studies using this approach on real-world atmospheric data have shown great potential for scientific discovery. However, in these initial studies no ground truth was available, so that the resulting graphs have been evaluated only by whether a domain expert thinks they seemed physically plausible. The lack of ground truth is a typical problem when using causal discovery in the geosciences. Furthermore, while most of the connections found by this method match domain knowledge, we encountered one type of connection for which no explanation was found. To address both of these issues we developed a simulation framework that generates synthetic data of typical atmospheric processes (advection and diffusion). Applying the causal discovery algorithm to the synthetic data allowed us (1) to develop a better understanding of how these physical processes appear in the resulting connectivity graphs, and thus how to better interpret such connectivity graphs when obtained from real-world data; (2) to solve the mystery of the previously unexplained connections.

  10. Logical NAND and NOR Operations Using Algorithmic Self-assembly of DNA Molecules

    NASA Astrophysics Data System (ADS)

    Wang, Yanfeng; Cui, Guangzhao; Zhang, Xuncai; Zheng, Yan

    DNA self-assembly is the most advanced and versatile system that has been experimentally demonstrated for programmable construction of patterned systems on the molecular scale. It has been demonstrated that the simple binary arithmetic and logical operations can be computed by the process of self assembly of DNA tiles. Here we report a one-dimensional algorithmic self-assembly of DNA triple-crossover molecules that can be used to execute five steps of a logical NAND and NOR operations on a string of binary bits. To achieve this, abstract tiles were translated into DNA tiles based on triple-crossover motifs. Serving as input for the computation, long single stranded DNA molecules were used to nucleate growth of tiles into algorithmic crystals. Our method shows that engineered DNA self-assembly can be treated as a bottom-up design techniques, and can be capable of designing DNA computer organization and architecture.

  11. Virtual Observatories, Data Mining, and Astroinformatics

    NASA Astrophysics Data System (ADS)

    Borne, Kirk

    The historical, current, and future trends in knowledge discovery from data in astronomy are presented here. The story begins with a brief history of data gathering and data organization. A description of the development ofnew information science technologies for astronomical discovery is then presented. Among these are e-Science and the virtual observatory, with its data discovery, access, display, and integration protocols; astroinformatics and data mining for exploratory data analysis, information extraction, and knowledge discovery from distributed data collections; new sky surveys' databases, including rich multivariate observational parameter sets for large numbers of objects; and the emerging discipline of data-oriented astronomical research, called astroinformatics. Astroinformatics is described as the fourth paradigm of astronomical research, following the three traditional research methodologies: observation, theory, and computation/modeling. Astroinformatics research areas include machine learning, data mining, visualization, statistics, semantic science, and scientific data management.Each of these areas is now an active research discipline, with significantscience-enabling applications in astronomy. Research challenges and sample research scenarios are presented in these areas, in addition to sample algorithms for data-oriented research. These information science technologies enable scientific knowledge discovery from the increasingly large and complex data collections in astronomy. The education and training of the modern astronomy student must consequently include skill development in these areas, whose practitioners have traditionally been limited to applied mathematicians, computer scientists, and statisticians. Modern astronomical researchers must cross these traditional discipline boundaries, thereby borrowing the best of breed methodologies from multiple disciplines. In the era of large sky surveys and numerous large telescopes, the potential for astronomical discovery is equally large, and so the data-oriented research methods, algorithms, and techniques that are presented here will enable the greatest discovery potential from the ever-growing data and information resources in astronomy.

  12. PRISM offers a comprehensive genomic approach to transcription factor function prediction

    PubMed Central

    Wenger, Aaron M.; Clarke, Shoa L.; Guturu, Harendra; Chen, Jenny; Schaar, Bruce T.; McLean, Cory Y.; Bejerano, Gill

    2013-01-01

    The human genome encodes 1500–2000 different transcription factors (TFs). ChIP-seq is revealing the global binding profiles of a fraction of TFs in a fraction of their biological contexts. These data show that the majority of TFs bind directly next to a large number of context-relevant target genes, that most binding is distal, and that binding is context specific. Because of the effort and cost involved, ChIP-seq is seldom used in search of novel TF function. Such exploration is instead done using expression perturbation and genetic screens. Here we propose a comprehensive computational framework for transcription factor function prediction. We curate 332 high-quality nonredundant TF binding motifs that represent all major DNA binding domains, and improve cross-species conserved binding site prediction to obtain 3.3 million conserved, mostly distal, binding site predictions. We combine these with 2.4 million facts about all human and mouse gene functions, in a novel statistical framework, in search of enrichments of particular motifs next to groups of target genes of particular functions. Rigorous parameter tuning and a harsh null are used to minimize false positives. Our novel PRISM (predicting regulatory information from single motifs) approach obtains 2543 TF function predictions in a large variety of contexts, at a false discovery rate of 16%. The predictions are highly enriched for validated TF roles, and 45 of 67 (67%) tested binding site regions in five different contexts act as enhancers in functionally matched cells. PMID:23382538

  13. BioWord: A sequence manipulation suite for Microsoft Word

    PubMed Central

    2012-01-01

    Background The ability to manipulate, edit and process DNA and protein sequences has rapidly become a necessary skill for practicing biologists across a wide swath of disciplines. In spite of this, most everyday sequence manipulation tools are distributed across several programs and web servers, sometimes requiring installation and typically involving frequent switching between applications. To address this problem, here we have developed BioWord, a macro-enabled self-installing template for Microsoft Word documents that integrates an extensive suite of DNA and protein sequence manipulation tools. Results BioWord is distributed as a single macro-enabled template that self-installs with a single click. After installation, BioWord will open as a tab in the Office ribbon. Biologists can then easily manipulate DNA and protein sequences using a familiar interface and minimize the need to switch between applications. Beyond simple sequence manipulation, BioWord integrates functionality ranging from dyad search and consensus logos to motif discovery and pair-wise alignment. Written in Visual Basic for Applications (VBA) as an open source, object-oriented project, BioWord allows users with varying programming experience to expand and customize the program to better meet their own needs. Conclusions BioWord integrates a powerful set of tools for biological sequence manipulation within a handy, user-friendly tab in a widely used word processing software package. The use of a simple scripting language and an object-oriented scheme facilitates customization by users and provides a very accessible educational platform for introducing students to basic bioinformatics algorithms. PMID:22676326

  14. Advanced text and video analytics for proactive decision making

    NASA Astrophysics Data System (ADS)

    Bowman, Elizabeth K.; Turek, Matt; Tunison, Paul; Porter, Reed; Thomas, Steve; Gintautas, Vadas; Shargo, Peter; Lin, Jessica; Li, Qingzhe; Gao, Yifeng; Li, Xiaosheng; Mittu, Ranjeev; Rosé, Carolyn Penstein; Maki, Keith; Bogart, Chris; Choudhari, Samrihdi Shree

    2017-05-01

    Today's warfighters operate in a highly dynamic and uncertain world, and face many competing demands. Asymmetric warfare and the new focus on small, agile forces has altered the framework by which time critical information is digested and acted upon by decision makers. Finding and integrating decision-relevant information is increasingly difficult in data-dense environments. In this new information environment, agile data algorithms, machine learning software, and threat alert mechanisms must be developed to automatically create alerts and drive quick response. Yet these advanced technologies must be balanced with awareness of the underlying context to accurately interpret machine-processed indicators and warnings and recommendations. One promising approach to this challenge brings together information retrieval strategies from text, video, and imagery. In this paper, we describe a technology demonstration that represents two years of tri-service research seeking to meld text and video for enhanced content awareness. The demonstration used multisource data to find an intelligence solution to a problem using a common dataset. Three technology highlights from this effort include 1) Incorporation of external sources of context into imagery normalcy modeling and anomaly detection capabilities, 2) Automated discovery and monitoring of targeted users from social media text, regardless of language, and 3) The concurrent use of text and imagery to characterize behaviour using the concept of kinematic and text motifs to detect novel and anomalous patterns. Our demonstration provided a technology baseline for exploiting heterogeneous data sources to deliver timely and accurate synopses of data that contribute to a dynamic and comprehensive worldview.

  15. BioWord: a sequence manipulation suite for Microsoft Word.

    PubMed

    Anzaldi, Laura J; Muñoz-Fernández, Daniel; Erill, Ivan

    2012-06-07

    The ability to manipulate, edit and process DNA and protein sequences has rapidly become a necessary skill for practicing biologists across a wide swath of disciplines. In spite of this, most everyday sequence manipulation tools are distributed across several programs and web servers, sometimes requiring installation and typically involving frequent switching between applications. To address this problem, here we have developed BioWord, a macro-enabled self-installing template for Microsoft Word documents that integrates an extensive suite of DNA and protein sequence manipulation tools. BioWord is distributed as a single macro-enabled template that self-installs with a single click. After installation, BioWord will open as a tab in the Office ribbon. Biologists can then easily manipulate DNA and protein sequences using a familiar interface and minimize the need to switch between applications. Beyond simple sequence manipulation, BioWord integrates functionality ranging from dyad search and consensus logos to motif discovery and pair-wise alignment. Written in Visual Basic for Applications (VBA) as an open source, object-oriented project, BioWord allows users with varying programming experience to expand and customize the program to better meet their own needs. BioWord integrates a powerful set of tools for biological sequence manipulation within a handy, user-friendly tab in a widely used word processing software package. The use of a simple scripting language and an object-oriented scheme facilitates customization by users and provides a very accessible educational platform for introducing students to basic bioinformatics algorithms.

  16. Forecasting petroleum discoveries in sparsely drilled areas: Nigeria and the North Sea

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Attanasi, E.D.; Root, D.H.

    1988-10-01

    Decline function methods for projecting future discoveries generally capture the crowding effects of wildcat wells on the discovery rate. However, these methods do not accommodate easily situations where exploration areas and horizons are expanding. In this paper, a method is presented that uses a mapping algorithm for separating these often countervailing influences. The method is applied to Nigeria and the North Sea. For an amount of future drilling equivalent to past drilling (825 wildcat wells), future discoveries (in resources found) for Nigeria are expected to decline by 68% per well but still amount to 8.5 billion barrels of oil equivalentmore » (BOE). Similarly, for the total North Sea for an equivalent amount and mix among areas of past drilling (1322 wildcat wells), future discoveries are expected to amount to 17.9 billion BOE, whereas the average discovery rate per well is expected to decline by 71%.« less

  17. Forecasting petroleum discoveries in sparsely drilled areas: Nigeria and the North Sea

    USGS Publications Warehouse

    Attanasi, E.D.; Root, D.H.

    1988-01-01

    Decline function methods for projecting future discoveries generally capture the crowding effects of wildcat wells on the discovery rate. However, these methods do not accommodate easily situations where exploration areas and horizons are expanding. In this paper, a method is presented that uses a mapping algorithm for separating these often countervailing influences. The method is applied to Nigeria and the North Sea. For an amount of future drilling equivalent to past drilling (825 wildcat wells), future discoveries (in resources found) for Nigeria are expected to decline by 68% per well but still amount to 8.5 billion barrels of oil equivalent (BOE). Similarly, for the total North Sea for an equivalent amount and mix among areas of past drilling (1322 wildcat wells), future discoveries are expected to amount to 17.9 billion BOE, whereas the average discovery rate per well is expected to decline by 71%. ?? 1988 International Association for Mathematical Geology.

  18. A modified CoRoT detrend algorithm and the discovery of a new planetary companion

    NASA Astrophysics Data System (ADS)

    Boufleur, Rodrigo C.; Emilio, Marcelo; Janot-Pacheco, Eduardo; Andrade, Laerte; Ferraz-Mello, Sylvio; do Nascimento, José-Dias, Jr.; de La Reza, Ramiro

    2018-01-01

    We present MCDA, a modification of the COnvection ROtation and planetary Transits (CoRoT) detrend algorithm (CDA) suitable to detrend chromatic light curves. By means of robust statistics and better handling of short-term variability, the implementation decreases the systematic light-curve variations and improves the detection of exoplanets when compared with the original algorithm. All CoRoT chromatic light curves (a total of 65 655) were analysed with our algorithm. Dozens of new transit candidates and all previously known CoRoT exoplanets were rediscovered in those light curves using a box-fitting algorithm. For three of the new cases, spectroscopic measurements of the candidates' host stars were retrieved from the ESO Science Archive Facility and used to calculate stellar parameters and, in the best cases, radial velocities. In addition to our improved detrend technique, we announce the discovery of a planet that orbits a 0.79_{-0.09}^{+0.08} R⊙ star with a period of 6.718 37 ± 0.000 01 d and has 0.57_{-0.05}^{+0.06} RJ and 0.15 ± 0.10 MJ. We also present the analysis of two cases in which parameters found suggest the existence of possible planetary companions.

  19. Entropic Profiler – detection of conservation in genomes using information theory

    PubMed Central

    Fernandes, Francisco; Freitas, Ana T; Almeida, Jonas S; Vinga, Susana

    2009-01-01

    Background In the last decades, with the successive availability of whole genome sequences, many research efforts have been made to mathematically model DNA. Entropic Profiles (EP) were proposed recently as a new measure of continuous entropy of genome sequences. EP represent local information plots related to DNA randomness and are based on information theory and statistical concepts. They express the weighed relative abundance of motifs for each position in genomes. Their study is very relevant because under or over-representation segments are often associated with significant biological meaning. Findings The Entropic Profiler application here presented is a new tool designed to detect and extract under and over-represented DNA segments in genomes by using EP. It allows its computation in a very efficient way by recurring to improved algorithms and data structures, which include modified suffix trees. Available through a web interface and as downloadable source code, it allows to study positions and to search for motifs inside the whole sequence or within a specified range. DNA sequences can be entered from different sources, including FASTA files, pre-loaded examples or resuming a previously saved work. Besides the EP value plots, p-values and z-scores for each motif are also computed, along with the Chaos Game Representation of the sequence. Conclusion EP are directly related with the statistical significance of motifs and can be considered as a new method to extract and classify significant regions in genomes and estimate local scales in DNA. The present implementation establishes an efficient and useful tool for whole genome analysis. PMID:19416538

  20. Research of three level match method about semantic web service based on ontology

    NASA Astrophysics Data System (ADS)

    Xiao, Jie; Cai, Fang

    2011-10-01

    An important step of Web service Application is the discovery of useful services. Keywords are used in service discovery in traditional technology like UDDI and WSDL, with the disadvantage of user intervention, lack of semantic description and low accuracy. To cope with these problems, OWL-S is introduced and extended with QoS attributes to describe the attribute and functions of Web Services. A three-level service matching algorithm based on ontology and QOS in proposed in this paper. Our algorithm can match web service by utilizing the service profile, QoS parameters together with input and output of the service. Simulation results shows that it greatly enhanced the speed of service matching while high accuracy is also guaranteed.

  1. Breaking free from chemical spreadsheets.

    PubMed

    Segall, Matthew; Champness, Ed; Leeding, Chris; Chisholm, James; Hunt, Peter; Elliott, Alex; Garcia-Martinez, Hector; Foster, Nick; Dowling, Samuel

    2015-09-01

    Drug discovery scientists often consider compounds and data in terms of groups, such as chemical series, and relationships, representing similarity or structural transformations, to aid compound optimisation. This is often supported by chemoinformatics algorithms, for example clustering and matched molecular pair analysis. However, chemistry software packages commonly present these data as spreadsheets or form views that make it hard to find relevant patterns or compare related compounds conveniently. Here, we review common data visualisation and analysis methods used to extract information from chemistry data. We introduce a new framework that enables scientists to work flexibly with drug discovery data to reflect their thought processes and interact with the output of algorithms to identify key structure-activity relationships and guide further optimisation intuitively. Copyright © 2015 Elsevier Ltd. All rights reserved.

  2. Objective performance assessment of five computed tomography iterative reconstruction algorithms.

    PubMed

    Omotayo, Azeez; Elbakri, Idris

    2016-11-22

    Iterative algorithms are gaining clinical acceptance in CT. We performed objective phantom-based image quality evaluation of five commercial iterative reconstruction algorithms available on four different multi-detector CT (MDCT) scanners at different dose levels as well as the conventional filtered back-projection (FBP) reconstruction. Using the Catphan500 phantom, we evaluated image noise, contrast-to-noise ratio (CNR), modulation transfer function (MTF) and noise-power spectrum (NPS). The algorithms were evaluated over a CTDIvol range of 0.75-18.7 mGy on four major MDCT scanners: GE DiscoveryCT750HD (algorithms: ASIR™ and VEO™); Siemens Somatom Definition AS+ (algorithm: SAFIRE™); Toshiba Aquilion64 (algorithm: AIDR3D™); and Philips Ingenuity iCT256 (algorithm: iDose4™). Images were reconstructed using FBP and the respective iterative algorithms on the four scanners. Use of iterative algorithms decreased image noise and increased CNR, relative to FBP. In the dose range of 1.3-1.5 mGy, noise reduction using iterative algorithms was in the range of 11%-51% on GE DiscoveryCT750HD, 10%-52% on Siemens Somatom Definition AS+, 49%-62% on Toshiba Aquilion64, and 13%-44% on Philips Ingenuity iCT256. The corresponding CNR increase was in the range 11%-105% on GE, 11%-106% on Siemens, 85%-145% on Toshiba and 13%-77% on Philips respectively. Most algorithms did not affect the MTF, except for VEO™ which produced an increase in the limiting resolution of up to 30%. A shift in the peak of the NPS curve towards lower frequencies and a decrease in NPS amplitude were obtained with all iterative algorithms. VEO™ required long reconstruction times, while all other algorithms produced reconstructions in real time. Compared to FBP, iterative algorithms reduced image noise and increased CNR. The iterative algorithms available on different scanners achieved different levels of noise reduction and CNR increase while spatial resolution improvements were obtained only with VEO™. This study is useful in that it provides performance assessment of the iterative algorithms available from several mainstream CT manufacturers.

  3. Unimolecular Reaction Pathways of a γ-Ketohydroperoxide from Combined Application of Automated Reaction Discovery Methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grambow, Colin A.; Jamal, Adeel; Li, Yi -Pei

    Ketohydroperoxides are important in liquid-phase autoxidation and in gas-phase partial oxidation and pre-ignition chemistry, but because of their low concentration, instability, and various analytical chemistry limitations, it has been challenging to experimentally determine their reactivity, and only a few pathways are known. In the present work, 75 elementary-step unimolecular reactions of the simplest γ-ketohydroperoxide, 3-hydroperoxypropanal, were discovered by a combination of density functional theory with several automated transition-state search algorithms: the Berny algorithm coupled with the freezing string method, single- and double-ended growing string methods, the heuristic KinBot algorithm, and the single-component artificial force induced reaction method (SC-AFIR). The presentmore » joint approach significantly outperforms previous manual and automated transition-state searches – 68 of the reactions of γ-ketohydroperoxide discovered here were previously unknown and completely unexpected. All of the methods found the lowest-energy transition state, which corresponds to the first step of the Korcek mechanism, but each algorithm except for SC-AFIR detected several reactions not found by any of the other methods. We show that the low-barrier chemical reactions involve promising new chemistry that may be relevant in atmospheric and combustion systems. Our study highlights the complexity of chemical space exploration and the advantage of combined application of several approaches. Altogether, the present work demonstrates both the power and the weaknesses of existing fully automated approaches for reaction discovery which suggest possible directions for further method development and assessment in order to enable reliable discovery of all important reactions of any specified reactant(s).« less

  4. Unimolecular Reaction Pathways of a γ-Ketohydroperoxide from Combined Application of Automated Reaction Discovery Methods

    DOE PAGES

    Grambow, Colin A.; Jamal, Adeel; Li, Yi -Pei; ...

    2017-12-22

    Ketohydroperoxides are important in liquid-phase autoxidation and in gas-phase partial oxidation and pre-ignition chemistry, but because of their low concentration, instability, and various analytical chemistry limitations, it has been challenging to experimentally determine their reactivity, and only a few pathways are known. In the present work, 75 elementary-step unimolecular reactions of the simplest γ-ketohydroperoxide, 3-hydroperoxypropanal, were discovered by a combination of density functional theory with several automated transition-state search algorithms: the Berny algorithm coupled with the freezing string method, single- and double-ended growing string methods, the heuristic KinBot algorithm, and the single-component artificial force induced reaction method (SC-AFIR). The presentmore » joint approach significantly outperforms previous manual and automated transition-state searches – 68 of the reactions of γ-ketohydroperoxide discovered here were previously unknown and completely unexpected. All of the methods found the lowest-energy transition state, which corresponds to the first step of the Korcek mechanism, but each algorithm except for SC-AFIR detected several reactions not found by any of the other methods. We show that the low-barrier chemical reactions involve promising new chemistry that may be relevant in atmospheric and combustion systems. Our study highlights the complexity of chemical space exploration and the advantage of combined application of several approaches. Altogether, the present work demonstrates both the power and the weaknesses of existing fully automated approaches for reaction discovery which suggest possible directions for further method development and assessment in order to enable reliable discovery of all important reactions of any specified reactant(s).« less

  5. ARG-walker: inference of individual specific strengths of meiotic recombination hotspots by population genomics analysis.

    PubMed

    Chen, Hao; Yang, Peng; Guo, Jing; Kwoh, Chee Keong; Przytycka, Teresa M; Zheng, Jie

    2015-01-01

    Meiotic recombination hotspots play important roles in various aspects of genomics, but the underlying mechanisms for regulating the locations and strengths of recombination hotspots are not yet fully revealed. Most existing algorithms for estimating recombination rates from sequence polymorphism data can only output average recombination rates of a population, although there is evidence for the heterogeneity in recombination rates among individuals. For genome-wide association studies (GWAS) of recombination hotspots, an efficient algorithm that estimates the individualized strengths of recombination hotspots is highly desirable. In this work, we propose a novel graph mining algorithm named ARG-walker, based on random walks on ancestral recombination graphs (ARG), to estimate individual-specific recombination hotspot strengths. Extensive simulations demonstrate that ARG-walker is able to distinguish the hot allele of a recombination hotspot from the cold allele. Integrated with output of ARG-walker, we performed GWAS on the phased haplotype data of the 22 autosome chromosomes of the HapMap Asian population samples of Chinese and Japanese (JPT+CHB). Significant cis-regulatory signals have been detected, which is corroborated by the enrichment of the well-known 13-mer motif CCNCCNTNNCCNC of PRDM9 protein. Moreover, two new DNA motifs have been identified in the flanking regions of the significantly associated SNPs (single nucleotide polymorphisms), which are likely to be new cis-regulatory elements of meiotic recombination hotspots of the human genome. Our results on both simulated and real data suggest that ARG-walker is a promising new method for estimating the individual recombination variations. In the future, it could be used to uncover the mechanisms of recombination regulation and human diseases related with recombination hotspots.

  6. Improved modeling of side-chain--base interactions and plasticity in protein--DNA interface design.

    PubMed

    Thyme, Summer B; Baker, David; Bradley, Philip

    2012-06-08

    Combinatorial sequence optimization for protein design requires libraries of discrete side-chain conformations. The discreteness of these libraries is problematic, particularly for long, polar side chains, since favorable interactions can be missed. Previously, an approach to loop remodeling where protein backbone movement is directed by side-chain rotamers predicted to form interactions previously observed in native complexes (termed "motifs") was described. Here, we show how such motif libraries can be incorporated into combinatorial sequence optimization protocols and improve native complex recapitulation. Guided by the motif rotamer searches, we made improvements to the underlying energy function, increasing recapitulation of native interactions. To further test the methods, we carried out a comprehensive experimental scan of amino acid preferences in the I-AniI protein-DNA interface and found that many positions tolerated multiple amino acids. This sequence plasticity is not observed in the computational results because of the fixed-backbone approximation of the model. We improved modeling of this diversity by introducing DNA flexibility and reducing the convergence of the simulated annealing algorithm that drives the design process. In addition to serving as a benchmark, this extensive experimental data set provides insight into the types of interactions essential to maintain the function of this potential gene therapy reagent. Published by Elsevier Ltd.

  7. Profiling of 3696 Nuclear Receptor-Coregulator Interactions: A Resource for Biological and Clinical Discovery.

    PubMed

    Broekema, Marjoleine F; Hollman, Danielle A A; Koppen, Arjen; van den Ham, Henk-Jan; Melchers, Diana; Pijnenburg, Dirk; Ruijtenbeek, Rob; van Mil, Saskia W C; Houtman, René; Kalkhoven, Eric

    2018-06-01

    Nuclear receptors (NRs) are ligand-inducible transcription factors that play critical roles in metazoan development, reproduction, and physiology and therefore are implicated in a broad range of pathologies. The transcriptional activity of NRs critically depends on their interaction(s) with transcriptional coregulator proteins, including coactivators and corepressors. Short leucine-rich peptide motifs in these proteins (LxxLL in coactivators and LxxxIxxxL in corepressors) are essential and sufficient for NR binding. With 350 different coregulator proteins identified to date and with many coregulators containing multiple interaction motifs, an enormous combinatorial potential is present for selective NR-mediated gene regulation. However, NR-coregulator interactions have often been determined experimentally on a one-to-one basis across diverse experimental conditions. In addition, NR-coregulator interactions are difficult to predict because the molecular determinants that govern specificity are not well established. Therefore, many biologically and clinically relevant NR-coregulator interactions may remain to be discovered. Here, we present a comprehensive overview of 3696 NR-coregulator interactions by systematically characterizing the binding of 24 nuclear receptors with 154 coregulator peptides. We identified unique ligand-dependent NR-coregulator interaction profiles for each NR, confirming many well-established NR-coregulator interactions. Hierarchical clustering based on the NR-coregulator interaction profiles largely recapitulates the classification of NR subfamilies based on the primary amino acid sequences of the ligand-binding domains, indicating that amino acid sequence is an important, although not the only, molecular determinant in directing and fine-tuning NR-coregulator interactions. This NR-coregulator peptide interactome provides an open data resource for future biological and clinical discovery as well as NR-based drug design.

  8. Discovery of cyclotides in the fabaceae plant family provides new insights into the cyclization, evolution, and distribution of circular proteins.

    PubMed

    Poth, Aaron G; Colgrave, Michelle L; Philip, Reynold; Kerenga, Bomai; Daly, Norelle L; Anderson, Marilyn A; Craik, David J

    2011-04-15

    Cyclotides are plant proteins whose defining structural features are a head-to-tail cyclized backbone and three interlocking disulfide bonds, which in combination are known as a cyclic cystine knot. This unique structural motif confers cyclotides with exceptional resistance to proteolysis. Their endogenous function is thought to be as plant defense agents, associated with their insecticidal and larval growth-inhibitory properties. However, in addition, an array of pharmaceutically relevant biological activities has been ascribed to cyclotides, including anti-HIV, anthelmintic, uterotonic, and antimicrobial effects. So far, >150 cyclotides have been elucidated from members of the Rubiaceae, Violaceae, and Cucurbitaceae plant families, but their wider distribution among other plant families remains unclear. Clitoria ternatea (Butterfly pea) is a member of plant family Fabaceae and through its usage in traditional medicine to aid childbirth bears similarity to Oldenlandia affinis, from which many cyclotides have been isolated. Using a combination of nanospray and matrix-assisted laser desorption ionization-time-of-flight (MALDI-TOF) analyses, we examined seed extracts of C. ternatea and discovered cyclotides in the Fabaceae, the third-largest family of flowering plants. We characterized 12 novel cyclotides, thus expanding knowledge of cyclotide distribution and evolution within the plant kingdom. The discovery of cyclotides containing novel sequence motifs near the in planta cyclization site has provided new insights into cyclotide biosynthesis. In particular, MS analyses of the novel cyclotides from C. ternatea suggest that Asn to Asp variants at the cyclization site are more common than previously recognized. Moreover, this study provides impetus for the examination of other economically and agriculturally significant species within Fabaceae, now the largest plant family from which cyclotides have been described.

  9. Global optimization of small bimetallic Pd-Co binary nanoalloy clusters: a genetic algorithm approach at the DFT level.

    PubMed

    Aslan, Mikail; Davis, Jack B A; Johnston, Roy L

    2016-03-07

    The global optimisation of small bimetallic PdCo binary nanoalloys are systematically investigated using the Birmingham Cluster Genetic Algorithm (BCGA). The effect of size and composition on the structures, stability, magnetic and electronic properties including the binding energies, second finite difference energies and mixing energies of Pd-Co binary nanoalloys are discussed. A detailed analysis of Pd-Co structural motifs and segregation effects is also presented. The maximal mixing energy corresponds to Pd atom compositions for which the number of mixed Pd-Co bonds is maximised. Global minimum clusters are distinguished from transition states by vibrational frequency analysis. HOMO-LUMO gap, electric dipole moment and vibrational frequency analyses are made to enable correlation with future experiments.

  10. Automated discovery of local search heuristics for satisfiability testing.

    PubMed

    Fukunaga, Alex S

    2008-01-01

    The development of successful metaheuristic algorithms such as local search for a difficult problem such as satisfiability testing (SAT) is a challenging task. We investigate an evolutionary approach to automating the discovery of new local search heuristics for SAT. We show that several well-known SAT local search algorithms such as Walksat and Novelty are composite heuristics that are derived from novel combinations of a set of building blocks. Based on this observation, we developed CLASS, a genetic programming system that uses a simple composition operator to automatically discover SAT local search heuristics. New heuristics discovered by CLASS are shown to be competitive with the best Walksat variants, including Novelty+. Evolutionary algorithms have previously been applied to directly evolve a solution for a particular SAT instance. We show that the heuristics discovered by CLASS are also competitive with these previous, direct evolutionary approaches for SAT. We also analyze the local search behavior of the learned heuristics using the depth, mobility, and coverage metrics proposed by Schuurmans and Southey.

  11. DenguePredict: An Integrated Drug Repositioning Approach towards Drug Discovery for Dengue.

    PubMed

    Wang, QuanQiu; Xu, Rong

    2015-01-01

    Dengue is a viral disease of expanding global incidence without cures. Here we present a drug repositioning system (DenguePredict) leveraging upon a unique drug treatment database and vast amounts of disease- and drug-related data. We first constructed a large-scale genetic disease network with enriched dengue genetics data curated from biomedical literature. We applied a network-based ranking algorithm to find dengue-related diseases from the disease network. We then developed a novel algorithm to prioritize FDA-approved drugs from dengue-related diseases to treat dengue. When tested in a de-novo validation setting, DenguePredict found the only two drugs tested in clinical trials for treating dengue and ranked them highly: chloroquine ranked at top 0.96% and ivermectin at top 22.75%. We showed that drugs targeting immune systems and arachidonic acid metabolism-related apoptotic pathways might represent innovative drugs to treat dengue. In summary, DenguePredict, by combining comprehensive disease- and drug-related data and novel algorithms, may greatly facilitate drug discovery for dengue.

  12. Visualizing Dynamic Bitcoin Transaction Patterns.

    PubMed

    McGinn, Dan; Birch, David; Akroyd, David; Molina-Solana, Miguel; Guo, Yike; Knottenbelt, William J

    2016-06-01

    This work presents a systemic top-down visualization of Bitcoin transaction activity to explore dynamically generated patterns of algorithmic behavior. Bitcoin dominates the cryptocurrency markets and presents researchers with a rich source of real-time transactional data. The pseudonymous yet public nature of the data presents opportunities for the discovery of human and algorithmic behavioral patterns of interest to many parties such as financial regulators, protocol designers, and security analysts. However, retaining visual fidelity to the underlying data to retain a fuller understanding of activity within the network remains challenging, particularly in real time. We expose an effective force-directed graph visualization employed in our large-scale data observation facility to accelerate this data exploration and derive useful insight among domain experts and the general public alike. The high-fidelity visualizations demonstrated in this article allowed for collaborative discovery of unexpected high frequency transaction patterns, including automated laundering operations, and the evolution of multiple distinct algorithmic denial of service attacks on the Bitcoin network.

  13. Visualizing Dynamic Bitcoin Transaction Patterns

    PubMed Central

    McGinn, Dan; Birch, David; Akroyd, David; Molina-Solana, Miguel; Guo, Yike; Knottenbelt, William J.

    2016-01-01

    Abstract This work presents a systemic top-down visualization of Bitcoin transaction activity to explore dynamically generated patterns of algorithmic behavior. Bitcoin dominates the cryptocurrency markets and presents researchers with a rich source of real-time transactional data. The pseudonymous yet public nature of the data presents opportunities for the discovery of human and algorithmic behavioral patterns of interest to many parties such as financial regulators, protocol designers, and security analysts. However, retaining visual fidelity to the underlying data to retain a fuller understanding of activity within the network remains challenging, particularly in real time. We expose an effective force-directed graph visualization employed in our large-scale data observation facility to accelerate this data exploration and derive useful insight among domain experts and the general public alike. The high-fidelity visualizations demonstrated in this article allowed for collaborative discovery of unexpected high frequency transaction patterns, including automated laundering operations, and the evolution of multiple distinct algorithmic denial of service attacks on the Bitcoin network. PMID:27441715

  14. Identification of Patients with Family History of Pancreatic Cancer - Investigation of an NLP System Portability

    PubMed Central

    Mehrabi, Saeed; Krishnan, Anand; Roch, Alexandra M; Schmidt, Heidi; Li, DingCheng; Kesterson, Joe; Beesley, Chris; Dexter, Paul; Schmidt, Max; Palakal, Mathew; Liu, Hongfang

    2018-01-01

    In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance. PMID:26262122

  15. An Iterative Time Windowed Signature Algorithm for Time Dependent Transcription Module Discovery

    PubMed Central

    Meng, Jia; Gao, Shou-Jiang; Huang, Yufei

    2010-01-01

    An algorithm for the discovery of time varying modules using genome-wide expression data is present here. When applied to large-scale time serious data, our method is designed to discover not only the transcription modules but also their timing information, which is rarely annotated by the existing approaches. Rather than assuming commonly defined time constant transcription modules, a module is depicted as a set of genes that are co-regulated during a specific period of time, i.e., a time dependent transcription module (TDTM). A rigorous mathematical definition of TDTM is provided, which is serve as an objective function for retrieving modules. Based on the definition, an effective signature algorithm is proposed that iteratively searches the transcription modules from the time series data. The proposed method was tested on the simulated systems and applied to the human time series microarray data during Kaposi's sarcoma-associated herpesvirus (KSHV) infection. The result has been verified by Expression Analysis Systematic Explorer. PMID:21552463

  16. Prediction of TF target sites based on atomistic models of protein-DNA complexes

    PubMed Central

    Angarica, Vladimir Espinosa; Pérez, Abel González; Vasconcelos, Ana T; Collado-Vides, Julio; Contreras-Moreira, Bruno

    2008-01-01

    Background The specific recognition of genomic cis-regulatory elements by transcription factors (TFs) plays an essential role in the regulation of coordinated gene expression. Studying the mechanisms determining binding specificity in protein-DNA interactions is thus an important goal. Most current approaches for modeling TF specific recognition rely on the knowledge of large sets of cognate target sites and consider only the information contained in their primary sequence. Results Here we describe a structure-based methodology for predicting sequence motifs starting from the coordinates of a TF-DNA complex. Our algorithm combines information regarding the direct and indirect readout of DNA into an atomistic statistical model, which is used to estimate the interaction potential. We first measure the ability of our method to correctly estimate the binding specificities of eight prokaryotic and eukaryotic TFs that belong to different structural superfamilies. Secondly, the method is applied to two homology models, finding that sampling of interface side-chain rotamers remarkably improves the results. Thirdly, the algorithm is compared with a reference structural method based on contact counts, obtaining comparable predictions for the experimental complexes and more accurate sequence motifs for the homology models. Conclusion Our results demonstrate that atomic-detail structural information can be feasibly used to predict TF binding sites. The computational method presented here is universal and might be applied to other systems involving protein-DNA recognition. PMID:18922190

  17. Discovery of potent 1H-imidazo[4,5-b]pyridine-based c-Met kinase inhibitors via mechanism-directed structural optimization.

    PubMed

    An, Xiao-De; Liu, Hongyan; Xu, Zhong-Liang; Jin, Yi; Peng, Xia; Yao, Ying-Ming; Geng, Meiyu; Long, Ya-Qiu

    2015-02-01

    Starting from our previously identified novel c-Met kinase inhibitors bearing 1H-imidazo[4,5-h][1,6]naphthyridin-2(3H)-one scaffold, a global structural exploration was conducted to furnish an optimal binding motif for further development, directed by the enzyme inhibitory mechanism. First round SAR study picked two imidazonaphthyridinone frameworks with 1,8- and 3,5-disubstitution pattern as class I and class II c-Met kinase inhibitors, respectively. Further structural optimization on type II inhibitors by truncation of the imidazonaphthyridinone core and incorporation of an N-phenyl cyclopropane-1,1-dicarboxamide pharmacophore led to the discovery of novel imidazopyridine-based c-Met kinase inhibitors, displaying nanomolar enzyme inhibitory activity and improved Met kinase selectivity. More significantly, the new chemotype c-Met kinase inhibitors effectively inhibited Met phosphorylation and its downstream signaling as well as the proliferation of Met-dependent EBC-1 human lung cancer cells at submicromolar concentrations. Copyright © 2014 Elsevier Ltd. All rights reserved.

  18. A novel spiroindoline targets cell cycle and migration via modulation of microtubule cytoskeleton.

    PubMed

    Kumar, Naveen; Hati, Santanu; Munshi, Parthapratim; Sen, Subhabrata; Sehrawat, Seema; Singh, Shailja

    2017-05-01

    Natural product-inspired libraries of molecules with diverse architectures have evolved as one of the most useful tools for discovering lead molecules for drug discovery. In comparison to conventional combinatorial libraries, these molecules have been inferred to perform better in phenotypic screening against complicated targets. Diversity-oriented synthesis (DOS) is a forward directional strategy to access such multifaceted library of molecules. From a successful DOS campaign of a natural product-inspired library, recently a small molecule with spiroindoline motif was identified as a potent anti-breast cancer compound. Herein we report the subcellular studies performed for this molecule on breast cancer cells. Our investigation revealed that it repositions microtubule cytoskeleton and displaces AKAP9 located at the microtubule organization centre. DNA ladder assay and cell cycle experiments further established the molecule as an apoptotic agent. This work further substantiated the amalgamation of DOS-phenotypic screening-sub-cellular studies as a consolidated blueprint for the discovery of potential pharmaceutical drug candidates.

  19. Seven-spot ladybird optimization: a novel and efficient metaheuristic algorithm for numerical optimization.

    PubMed

    Wang, Peng; Zhu, Zhouquan; Huang, Shuai

    2013-01-01

    This paper presents a novel biologically inspired metaheuristic algorithm called seven-spot ladybird optimization (SLO). The SLO is inspired by recent discoveries on the foraging behavior of a seven-spot ladybird. In this paper, the performance of the SLO is compared with that of the genetic algorithm, particle swarm optimization, and artificial bee colony algorithms by using five numerical benchmark functions with multimodality. The results show that SLO has the ability to find the best solution with a comparatively small population size and is suitable for solving optimization problems with lower dimensions.

  20. Seven-Spot Ladybird Optimization: A Novel and Efficient Metaheuristic Algorithm for Numerical Optimization

    PubMed Central

    Zhu, Zhouquan

    2013-01-01

    This paper presents a novel biologically inspired metaheuristic algorithm called seven-spot ladybird optimization (SLO). The SLO is inspired by recent discoveries on the foraging behavior of a seven-spot ladybird. In this paper, the performance of the SLO is compared with that of the genetic algorithm, particle swarm optimization, and artificial bee colony algorithms by using five numerical benchmark functions with multimodality. The results show that SLO has the ability to find the best solution with a comparatively small population size and is suitable for solving optimization problems with lower dimensions. PMID:24385879

  1. Kleinberg Complex Networks

    DTIC Science & Technology

    2014-10-21

    linear combinations of paths. This project featured research on two classes of routing problems , namely traveling salesman problems and multicommodity...flows. One highlight of this research was our discovery of a polynomial-time algorithm for the metric traveling salesman s-t path problem whose...metric TSP would resolve one of the most venerable open problems in the theory of approximation algorithms. Our research on traveling salesman

  2. A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms.

    PubMed

    Goodswen, Stephen J; Kennedy, Paul J; Ellis, John T

    2013-11-02

    An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets. The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally. Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory.

  3. A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms

    PubMed Central

    2013-01-01

    Background An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets. Results The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally. Conclusions Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory. PMID:24180526

  4. A novel in silico approach to drug discovery via computational intelligence.

    PubMed

    Hecht, David; Fogel, Gary B

    2009-04-01

    A computational intelligence drug discovery platform is introduced as an innovative technology designed to accelerate high-throughput drug screening for generalized protein-targeted drug discovery. This technology results in collections of novel small molecule compounds that bind to protein targets as well as details on predicted binding modes and molecular interactions. The approach was tested on dihydrofolate reductase (DHFR) for novel antimalarial drug discovery; however, the methods developed can be applied broadly in early stage drug discovery and development. For this purpose, an initial fragment library was defined, and an automated fragment assembly algorithm was generated. These were combined with a computational intelligence screening tool for prescreening of compounds relative to DHFR inhibition. The entire method was assayed relative to spaces of known DHFR inhibitors and with chemical feasibility in mind, leading to experimental validation in future studies.

  5. Knowledge-based analysis of microarrays for the discovery of transcriptional regulation relationships

    PubMed Central

    2010-01-01

    Background The large amount of high-throughput genomic data has facilitated the discovery of the regulatory relationships between transcription factors and their target genes. While early methods for discovery of transcriptional regulation relationships from microarray data often focused on the high-throughput experimental data alone, more recent approaches have explored the integration of external knowledge bases of gene interactions. Results In this work, we develop an algorithm that provides improved performance in the prediction of transcriptional regulatory relationships by supplementing the analysis of microarray data with a new method of integrating information from an existing knowledge base. Using a well-known dataset of yeast microarrays and the Yeast Proteome Database, a comprehensive collection of known information of yeast genes, we show that knowledge-based predictions demonstrate better sensitivity and specificity in inferring new transcriptional interactions than predictions from microarray data alone. We also show that comprehensive, direct and high-quality knowledge bases provide better prediction performance. Comparison of our results with ChIP-chip data and growth fitness data suggests that our predicted genome-wide regulatory pairs in yeast are reasonable candidates for follow-up biological verification. Conclusion High quality, comprehensive, and direct knowledge bases, when combined with appropriate bioinformatic algorithms, can significantly improve the discovery of gene regulatory relationships from high throughput gene expression data. PMID:20122245

  6. Knowledge-based analysis of microarrays for the discovery of transcriptional regulation relationships.

    PubMed

    Seok, Junhee; Kaushal, Amit; Davis, Ronald W; Xiao, Wenzhong

    2010-01-18

    The large amount of high-throughput genomic data has facilitated the discovery of the regulatory relationships between transcription factors and their target genes. While early methods for discovery of transcriptional regulation relationships from microarray data often focused on the high-throughput experimental data alone, more recent approaches have explored the integration of external knowledge bases of gene interactions. In this work, we develop an algorithm that provides improved performance in the prediction of transcriptional regulatory relationships by supplementing the analysis of microarray data with a new method of integrating information from an existing knowledge base. Using a well-known dataset of yeast microarrays and the Yeast Proteome Database, a comprehensive collection of known information of yeast genes, we show that knowledge-based predictions demonstrate better sensitivity and specificity in inferring new transcriptional interactions than predictions from microarray data alone. We also show that comprehensive, direct and high-quality knowledge bases provide better prediction performance. Comparison of our results with ChIP-chip data and growth fitness data suggests that our predicted genome-wide regulatory pairs in yeast are reasonable candidates for follow-up biological verification. High quality, comprehensive, and direct knowledge bases, when combined with appropriate bioinformatic algorithms, can significantly improve the discovery of gene regulatory relationships from high throughput gene expression data.

  7. When drug discovery meets web search: Learning to Rank for ligand-based virtual screening.

    PubMed

    Zhang, Wei; Ji, Lijuan; Chen, Yanan; Tang, Kailin; Wang, Haiping; Zhu, Ruixin; Jia, Wei; Cao, Zhiwei; Liu, Qi

    2015-01-01

    The rapid increase in the emergence of novel chemical substances presents a substantial demands for more sophisticated computational methodologies for drug discovery. In this study, the idea of Learning to Rank in web search was presented in drug virtual screening, which has the following unique capabilities of 1). Applicable of identifying compounds on novel targets when there is not enough training data available for these targets, and 2). Integration of heterogeneous data when compound affinities are measured in different platforms. A standard pipeline was designed to carry out Learning to Rank in virtual screening. Six Learning to Rank algorithms were investigated based on two public datasets collected from Binding Database and the newly-published Community Structure-Activity Resource benchmark dataset. The results have demonstrated that Learning to rank is an efficient computational strategy for drug virtual screening, particularly due to its novel use in cross-target virtual screening and heterogeneous data integration. To the best of our knowledge, we have introduced here the first application of Learning to Rank in virtual screening. The experiment workflow and algorithm assessment designed in this study will provide a standard protocol for other similar studies. All the datasets as well as the implementations of Learning to Rank algorithms are available at http://www.tongji.edu.cn/~qiliu/lor_vs.html. Graphical AbstractThe analogy between web search and ligand-based drug discovery.

  8. Smell Detection Agent Based Optimization Algorithm

    NASA Astrophysics Data System (ADS)

    Vinod Chandra, S. S.

    2016-09-01

    In this paper, a novel nature-inspired optimization algorithm has been employed and the trained behaviour of dogs in detecting smell trails is adapted into computational agents for problem solving. The algorithm involves creation of a surface with smell trails and subsequent iteration of the agents in resolving a path. This algorithm can be applied in different computational constraints that incorporate path-based problems. Implementation of the algorithm can be treated as a shortest path problem for a variety of datasets. The simulated agents have been used to evolve the shortest path between two nodes in a graph. This algorithm is useful to solve NP-hard problems that are related to path discovery. This algorithm is also useful to solve many practical optimization problems. The extensive derivation of the algorithm can be enabled to solve shortest path problems.

  9. P-Finder: Reconstruction of Signaling Networks from Protein-Protein Interactions and GO Annotations.

    PubMed

    Young-Rae Cho; Yanan Xin; Speegle, Greg

    2015-01-01

    Because most complex genetic diseases are caused by defects of cell signaling, illuminating a signaling cascade is essential for understanding their mechanisms. We present three novel computational algorithms to reconstruct signaling networks between a starting protein and an ending protein using genome-wide protein-protein interaction (PPI) networks and gene ontology (GO) annotation data. A signaling network is represented as a directed acyclic graph in a merged form of multiple linear pathways. An advanced semantic similarity metric is applied for weighting PPIs as the preprocessing of all three methods. The first algorithm repeatedly extends the list of nodes based on path frequency towards an ending protein. The second algorithm repeatedly appends edges based on the occurrence of network motifs which indicate the link patterns more frequently appearing in a PPI network than in a random graph. The last algorithm uses the information propagation technique which iteratively updates edge orientations based on the path strength and merges the selected directed edges. Our experimental results demonstrate that the proposed algorithms achieve higher accuracy than previous methods when they are tested on well-studied pathways of S. cerevisiae. Furthermore, we introduce an interactive web application tool, called P-Finder, to visualize reconstructed signaling networks.

  10. Advances in Significance Testing for Cluster Detection

    NASA Astrophysics Data System (ADS)

    Coleman, Deidra Andrea

    Over the past two decades, much attention has been given to data driven project goals such as the Human Genome Project and the development of syndromic surveillance systems. A major component of these types of projects is analyzing the abundance of data. Detecting clusters within the data can be beneficial as it can lead to the identification of specified sequences of DNA nucleotides that are related to important biological functions or the locations of epidemics such as disease outbreaks or bioterrorism attacks. Cluster detection techniques require efficient and accurate hypothesis testing procedures. In this dissertation, we improve upon the hypothesis testing procedures for cluster detection by enhancing distributional theory and providing an alternative method for spatial cluster detection using syndromic surveillance data. In Chapter 2, we provide an efficient method to compute the exact distribution of the number and coverage of h-clumps of a collection of words. This method involves defining a Markov chain using a minimal deterministic automaton to reduce the number of states needed for computation. We allow words of the collection to contain other words of the collection making the method more general. We use our method to compute the distributions of the number and coverage of h-clumps in the Chi motif of H. influenza.. In Chapter 3, we provide an efficient algorithm to compute the exact distribution of multiple window discrete scan statistics for higher-order, multi-state Markovian sequences. This algorithm involves defining a Markov chain to efficiently keep track of probabilities needed to compute p-values of the statistic. We use our algorithm to identify cases where the available approximation does not perform well. We also use our algorithm to detect unusual clusters of made free throw shots by National Basketball Association players during the 2009-2010 regular season. In Chapter 4, we give a procedure to detect outbreaks using syndromic surveillance data while controlling the Bayesian False Discovery Rate (BFDR). The procedure entails choosing an appropriate Bayesian model that captures the spatial dependency inherent in epidemiological data and considers all days of interest, selecting a test statistic based on a chosen measure that provides the magnitude of the maximumal spatial cluster for each day, and identifying a cutoff value that controls the BFDR for rejecting the collective null hypothesis of no outbreak over a collection of days for a specified region.We use our procedure to analyze botulism-like syndrome data collected by the North Carolina Disease Event Tracking and Epidemiologic Collection Tool (NC DETECT).

  11. Open-source chemogenomic data-driven algorithms for predicting drug-target interactions.

    PubMed

    Hao, Ming; Bryant, Stephen H; Wang, Yanli

    2018-02-06

    While novel technologies such as high-throughput screening have advanced together with significant investment by pharmaceutical companies during the past decades, the success rate for drug development has not yet been improved prompting researchers looking for new strategies of drug discovery. Drug repositioning is a potential approach to solve this dilemma. However, experimental identification and validation of potential drug targets encoded by the human genome is both costly and time-consuming. Therefore, effective computational approaches have been proposed to facilitate drug repositioning, which have proved to be successful in drug discovery. Doubtlessly, the availability of open-accessible data from basic chemical biology research and the success of human genome sequencing are crucial to develop effective in silico drug repositioning methods allowing the identification of potential targets for existing drugs. In this work, we review several chemogenomic data-driven computational algorithms with source codes publicly accessible for predicting drug-target interactions (DTIs). We organize these algorithms by model properties and model evolutionary relationships. We re-implemented five representative algorithms in R programming language, and compared these algorithms by means of mean percentile ranking, a new recall-based evaluation metric in the DTI prediction research field. We anticipate that this review will be objective and helpful to researchers who would like to further improve existing algorithms or need to choose appropriate algorithms to infer potential DTIs in the projects. The source codes for DTI predictions are available at: https://github.com/minghao2016/chemogenomicAlg4DTIpred. Published by Oxford University Press 2018. This work is written by US Government employees and is in the public domain in the US.

  12. ASPeak: an abundance sensitive peak detection algorithm for RIP-Seq.

    PubMed

    Kucukural, Alper; Özadam, Hakan; Singh, Guramrit; Moore, Melissa J; Cenik, Can

    2013-10-01

    Unlike DNA, RNA abundances can vary over several orders of magnitude. Thus, identification of RNA-protein binding sites from high-throughput sequencing data presents unique challenges. Although peak identification in ChIP-Seq data has been extensively explored, there are few bioinformatics tools tailored for peak calling on analogous datasets for RNA-binding proteins. Here we describe ASPeak (abundance sensitive peak detection algorithm), an implementation of an algorithm that we previously applied to detect peaks in exon junction complex RNA immunoprecipitation in tandem experiments. Our peak detection algorithm yields stringent and robust target sets enabling sensitive motif finding and downstream functional analyses. ASPeak is implemented in Perl as a complete pipeline that takes bedGraph files as input. ASPeak implementation is freely available at https://sourceforge.net/projects/as-peak under the GNU General Public License. ASPeak can be run on a personal computer, yet is designed to be easily parallelizable. ASPeak can also run on high performance computing clusters providing efficient speedup. The documentation and user manual can be obtained from http://master.dl.sourceforge.net/project/as-peak/manual.pdf.

  13. Electron-Poor Polar Intermetallics: Complex Structures, Novel Clusters, and Intriguing Bonding with Pronounced Electron Delocalization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lin, Qisheng; Miller, Gordon J.

    Intermetallic compounds represent an extensive pool of candidates for energy related applications stemming from magnetic, electric, optic, caloric, and catalytic properties. The discovery of novel intermetallic compounds can enhance understanding of the chemical principles that govern structural stability and chemical bonding as well as finding new applications. Valence electron-poor polar intermetallics with valence electron concentrations (VECs) between 2.0 and 3.0 e –/atom show a plethora of unprecedented and fascinating structural motifs and bonding features. Furthermore, establishing simple structure-bonding-property relationships is especially challenging for this compound class because commonly accepted valence electron counting rules are inappropriate.

  14. Electron-Poor Polar Intermetallics: Complex Structures, Novel Clusters, and Intriguing Bonding with Pronounced Electron Delocalization

    DOE PAGES

    Lin, Qisheng; Miller, Gordon J.

    2017-12-18

    Intermetallic compounds represent an extensive pool of candidates for energy related applications stemming from magnetic, electric, optic, caloric, and catalytic properties. The discovery of novel intermetallic compounds can enhance understanding of the chemical principles that govern structural stability and chemical bonding as well as finding new applications. Valence electron-poor polar intermetallics with valence electron concentrations (VECs) between 2.0 and 3.0 e –/atom show a plethora of unprecedented and fascinating structural motifs and bonding features. Furthermore, establishing simple structure-bonding-property relationships is especially challenging for this compound class because commonly accepted valence electron counting rules are inappropriate.

  15. Progress in Biomedical Knowledge Discovery: A 25-year Retrospective

    PubMed Central

    Sacchi, L.

    2016-01-01

    Summary Objectives We sought to explore, via a systematic review of the literature, the state of the art of knowledge discovery in biomedical databases as it existed in 1992, and then now, 25 years later, mainly focused on supervised learning. Methods We performed a rigorous systematic search of PubMed and latent Dirichlet allocation to identify themes in the literature and trends in the science of knowledge discovery in and between time periods and compare these trends. We restricted the result set using a bracket of five years previous, such that the 1992 result set was restricted to articles published between 1987 and 1992, and the 2015 set between 2011 and 2015. This was to reflect the current literature available at the time to researchers and others at the target dates of 1992 and 2015. The search term was framed as: Knowledge Discovery OR Data Mining OR Pattern Discovery OR Pattern Recognition, Automated. Results A total 538 and 18,172 documents were retrieved for 1992 and 2015, respectively. The number and type of data sources increased dramatically over the observation period, primarily due to the advent of electronic clinical systems. The period 1992-2015 saw the emergence of new areas of research in knowledge discovery, and the refinement and application of machine learning approaches that were nascent or unknown in 1992. Conclusions Over the 25 years of the observation period, we identified numerous developments that impacted the science of knowledge discovery, including the availability of new forms of data, new machine learning algorithms, and new application domains. Through a bibliometric analysis we examine the striking changes in the availability of highly heterogeneous data resources, the evolution of new algorithmic approaches to knowledge discovery, and we consider from legal, social, and political perspectives possible explanations of the growth of the field. Finally, we reflect on the achievements of the past 25 years to consider what the next 25 years will bring with regard to the availability of even more complex data and to the methods that could be, and are being now developed for the discovery of new knowledge in biomedical data. PMID:27488403

  16. Progress in Biomedical Knowledge Discovery: A 25-year Retrospective.

    PubMed

    Sacchi, L; Holmes, J H

    2016-08-02

    We sought to explore, via a systematic review of the literature, the state of the art of knowledge discovery in biomedical databases as it existed in 1992, and then now, 25 years later, mainly focused on supervised learning. We performed a rigorous systematic search of PubMed and latent Dirichlet allocation to identify themes in the literature and trends in the science of knowledge discovery in and between time periods and compare these trends. We restricted the result set using a bracket of five years previous, such that the 1992 result set was restricted to articles published between 1987 and 1992, and the 2015 set between 2011 and 2015. This was to reflect the current literature available at the time to researchers and others at the target dates of 1992 and 2015. The search term was framed as: Knowledge Discovery OR Data Mining OR Pattern Discovery OR Pattern Recognition, Automated. A total 538 and 18,172 documents were retrieved for 1992 and 2015, respectively. The number and type of data sources increased dramatically over the observation period, primarily due to the advent of electronic clinical systems. The period 1992- 2015 saw the emergence of new areas of research in knowledge discovery, and the refinement and application of machine learning approaches that were nascent or unknown in 1992. Over the 25 years of the observation period, we identified numerous developments that impacted the science of knowledge discovery, including the availability of new forms of data, new machine learning algorithms, and new application domains. Through a bibliometric analysis we examine the striking changes in the availability of highly heterogeneous data resources, the evolution of new algorithmic approaches to knowledge discovery, and we consider from legal, social, and political perspectives possible explanations of the growth of the field. Finally, we reflect on the achievements of the past 25 years to consider what the next 25 years will bring with regard to the availability of even more complex data and to the methods that could be, and are being now developed for the discovery of new knowledge in biomedical data.

  17. Automatic graph-cut based segmentation of bones from knee magnetic resonance images for osteoarthritis research.

    PubMed

    Ababneh, Sufyan Y; Prescott, Jeff W; Gurcan, Metin N

    2011-08-01

    In this paper, a new, fully automated, content-based system is proposed for knee bone segmentation from magnetic resonance images (MRI). The purpose of the bone segmentation is to support the discovery and characterization of imaging biomarkers for the incidence and progression of osteoarthritis, a debilitating joint disease, which affects a large portion of the aging population. The segmentation algorithm includes a novel content-based, two-pass disjoint block discovery mechanism, which is designed to support automation, segmentation initialization, and post-processing. The block discovery is achieved by classifying the image content to bone and background blocks according to their similarity to the categories in the training data collected from typical bone structures. The classified blocks are then used to design an efficient graph-cut based segmentation algorithm. This algorithm requires constructing a graph using image pixel data followed by applying a maximum-flow algorithm which generates a minimum graph-cut that corresponds to an initial image segmentation. Content-based refinements and morphological operations are then applied to obtain the final segmentation. The proposed segmentation technique does not require any user interaction and can distinguish between bone and highly similar adjacent structures, such as fat tissues with high accuracy. The performance of the proposed system is evaluated by testing it on 376 MR images from the Osteoarthritis Initiative (OAI) database. This database included a selection of single images containing the femur and tibia from 200 subjects with varying levels of osteoarthritis severity. Additionally, a full three-dimensional segmentation of the bones from ten subjects with 14 slices each, and synthetic images with background having intensity and spatial characteristics similar to those of bone are used to assess the robustness and consistency of the developed algorithm. The results show an automatic bone detection rate of 0.99 and an average segmentation accuracy of 0.95 using the Dice similarity index. Copyright © 2011 Elsevier B.V. All rights reserved.

  18. Computer-Aided Discovery Tools for Volcano Deformation Studies with InSAR and GPS

    NASA Astrophysics Data System (ADS)

    Pankratius, V.; Pilewskie, J.; Rude, C. M.; Li, J. D.; Gowanlock, M.; Bechor, N.; Herring, T.; Wauthier, C.

    2016-12-01

    We present a Computer-Aided Discovery approach that facilitates the cloud-scalable fusion of different data sources, such as GPS time series and Interferometric Synthetic Aperture Radar (InSAR), for the purpose of identifying the expansion centers and deformation styles of volcanoes. The tools currently developed at MIT allow the definition of alternatives for data processing pipelines that use various analysis algorithms. The Computer-Aided Discovery system automatically generates algorithmic and parameter variants to help researchers explore multidimensional data processing search spaces efficiently. We present first application examples of this technique using GPS data on volcanoes on the Aleutian Islands and work in progress on combined GPS and InSAR data in Hawaii. In the model search context, we also illustrate work in progress combining time series Principal Component Analysis with InSAR augmentation to constrain the space of possible model explanations on current empirical data sets and achieve a better identification of deformation patterns. This work is supported by NASA AIST-NNX15AG84G and NSF ACI-1442997 (PI: V. Pankratius).

  19. Materials Discovery | Materials Science | NREL

    Science.gov Websites

    measurement methods and specialized analysis algorithms. Projects Basic Research The basic research projects applications using high-throughput combinatorial research methods. Email | 303-384-6467 Photo of John Perkins

  20. Higher-Order Motion Inputs For Visual Figure Tracking: Control Algorithms and Neural Circuits

    DTIC Science & Technology

    2015-05-30

    3 3 Accomplishments / New Findings .......................................................................................... 3 3.1...Posters: ........................................................................ 51 6.2 Consultative and advisory functions ...53 7 New Discoveries, Inventions, or Patent Disclosures

  1. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Papadogiannakis, S.; Taddia, F.; Petrushevska, T.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Roy, R.; Hangard, L.; Vreeswijk, P.; Horesh, A.; Manulis, I.; Rubin, A.; Yaron, O.; Leloudas, G.; Khazov, D.; Soumagnac, M.; Knezevic, S.; Johansson, J.; Nir, G.; Cao, Y.; Blagorodnova, N.; Kulkarni, S.

    2016-05-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artefacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  2. iPTF Discoveries of Recent Core-Collapse Supernovae

    NASA Astrophysics Data System (ADS)

    Taddia, F.; Ferretti, R.; Papadogiannakis, S.; Petrushevska, T.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Roy, R.; Hangard, L.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Bar, I.; Cao, Y.; Kulkarni, S.; Blagorodnova, N.

    2016-05-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following core-collapse SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  3. iPTF Discoveries of Recent Core-Collapse Supernovae

    NASA Astrophysics Data System (ADS)

    Taddia, F.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Petrushevska, T.; Roy, R.; Hangard, L.; De Cia, A.; Vreeswijk, P.; Horesh, A.; Manulis, I.; Sagiv, I.; Rubin, A.; Yaron, O.; Leloudas, G.; Khazov, D.; Soumagnac, M.; Bilgi, P.

    2015-04-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Core-Collapse SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  4. iPTF Discoveries of Recent Type Ia Supernova

    NASA Astrophysics Data System (ADS)

    Petrushevska, T.; Ferretti, R.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Bilgi, P.; Cao, Y.; Duggan, G.; Lunnan, R.; Andreoni, I.

    2015-10-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  5. iPTF Discoveries of Recent SNe Ia

    NASA Astrophysics Data System (ADS)

    Ferretti, R.; Fremling, C.; Johansson, J.; Karamehmetoglu, E.; Migotto, K.; Nyholm, A.; Papadogiannakis, S.; Taddia, F.; Petrushevska, T.; Roy, R.; Ben-Ami, S.; De Cia, A.; Dzigan, Y.; Horesh, A.; Khazov, D.; Manulis, I.; Rubin, A.; Sagiv, I.; Vreeswijk, P.; Yaron, O.; Bilgi, P.; Cao, Y.; Duggan, G.

    2015-02-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  6. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Papadogiannakis, S.; Taddia, F.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Petrushevska, T.; Nyholm, A.; Roy, R.; Hangard, L.; Vreeswijk, P.; Horesh, A.; Manulis, I.; Rubin, A.; Yaron, O.; Leloudas, G.; Khazov, D.; Soumagnac, M.; Knezevic, S.; Johansson, J.; Lunnan, R.; Blagorodnova, N.; Cao, Y.; Cenk, S. B.

    2016-01-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  7. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Ferretti, R.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Petrushevska, T.; Roy, R.; Taddia, F.; Horesh, A.; Khazov, D.; Knezevic, S.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Cao, Y.; Duggan, G.; Lunnan, R.; Blagorodnova, N.

    2015-11-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  8. iPTF Discoveries of Recent Core-Collapse Supernovae

    NASA Astrophysics Data System (ADS)

    Taddia, F.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Petrushevska, T.; Roy, R.; Hangard, L.; Vreeswijk, P.; Horesh, A.; Manulis, I.; Rubin, A.; Yaron, O.; Leloudas, G.; Khazov, D.; Soumagnac, M.; Knezevic, S.; Johansson, J.; Duggan, G.; Lunnan, R.; Cao, Y.

    2015-09-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Core-Collapse SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  9. iPTF Discovery of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Hangard, L.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Petrushevska, T.; Roy, R.; Bar, I.; Horesh, A.; Johansson, J.; Khazov, D.; Knezevic, S.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Cao, Y.; Kulkarni, S.; Lunnan, R.; Ravi, V.; Vedantham, H. K.; Yan, L.

    2016-04-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  10. iPTF Discoveries of Recent Core-Collapse Supernovae

    NASA Astrophysics Data System (ADS)

    Taddia, F.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Petrushevska, T.; Roy, R.; Hangard, L.; Vreeswijk, P.; Horesh, A.; Manulis, I.; Rubin, A.; Yaron, O.; Leloudas, G.; Khazov, D.; Soumagnac, M.; Knezevic, S.; Johansson, J.; Lunnan, R.; Cao, Y.; Miller, A.

    2015-11-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Core-Collapse SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  11. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Petrushevska, T.; Ferretti, R.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Bilgi, P.; Cao, Y.; Duggan, G.; Lunnan, R.

    2016-02-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  12. iPTF Discovery of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Hangard, L.; Taddia, F.; Ferretti, R.; Papadogiannakis, S.; Petrushevska, T.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Bar, I.; Lunnan, R.; Cenk, S. B.

    2016-02-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  13. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Papadogiannakis, S.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Ferretti, R.; Petrushevska, T.; Roy, R.; Taddia, F.; Bar, I.; Horesh, A.; Johansson, J.; Knezevic, S.; Leloudas, G.; Manulis, I.; Nir, G.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Arcavi, I.; Howell, D. A.; McCully, C.; Hosseinzadeh, G.; Valenti, S.; Blagorodnova, N.; Cao, Y.; Duggan, G.; Ravi, V.; Lunnan, R.

    2016-03-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  14. iPTF discoveries of recent type Ia supernovae

    NASA Astrophysics Data System (ADS)

    Papadogiannakis, S.; Ferretti, R.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Petrushevska, T.; Roy, R.; De Cia, A.; Vreeswijk, P.; Horesh, A.; Manulis, I.; Sagiv, I.; Rubin, A.; Yaron, O.; Leloudas, G.; Khazov, D.; Soumagnac, M.; Knezevic, S.; Cenko, S. B.; Capone, J.; Bartakk, M.

    2015-09-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  15. iPTF Discovery of Recent Type Ia Supernova

    NASA Astrophysics Data System (ADS)

    Hangard, L.; Petrushevska, T.; Papadogiannakis, S.; Ferretti, R.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Kasliwal, M.

    2015-10-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  16. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Petrushevska, T.; Ferretti, R.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Bilgi, P.; Cao, Y.; Duggan, G.; Lunnan, R.; Neill, J. D.; Walters, R.

    2016-04-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  17. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Papadogiannakis, S.; Taddia, F.; Petrushevska, T.; Fremling, C.; Hangard, L.; Johansson, J.; Karamehmetoglu, E.; Migotto, K.; Nyholm, A.; Roy, R.; Ben-Ami, S.; De Cia, A.; Dzigan, Y.; Horesh, A.; Khazov, D.; Soumagnac, M.; Manulis, I.; Rubin, A.; Sagiv, I.; Vreeswijk, P.; Yaron, O.; Bond, H.; Bilgi, P.; Cao, Y.; Duggan, G.

    2015-03-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  18. iPTF Discovery of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Hangard, L.; Ferretti, R.; Papadogiannakis, S.; Petrushevska, T.; Fremling, C.; Karamehmetoglu, E.; Nyholm, A.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Cook, D.

    2015-12-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  19. iPTF Discoveries of Recent Type Ia Supernova

    NASA Astrophysics Data System (ADS)

    Petrushevska, T.; Ferretti, R.; Fremling, C.; Hangard, L.; Karamehmetoglu, E.; Nyholm, A.; Papadogiannakis, S.; Roy, R.; Horesh, A.; Khazov, D.; Knezevic, S.; Johansson, J.; Leloudas, G.; Manulis, I.; Rubin, A.; Soumagnac, M.; Vreeswijk, P.; Yaron, O.; Bilgi, P.; Cao, Y.; Duggan, G.; Lunnan, R.; Jencson, J.

    2015-11-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W).

  20. Discovery of Critical Residues for Viral Entry and Inhibition through Structural Insight of HIV-1 Fusion Inhibitor CP621–652*

    PubMed Central

    Chong, Huihui; Yao, Xue; Qiu, Zonglin; Qin, Bo; Han, Ruiyun; Waltersperger, Sandro; Wang, Meitian; Cui, Sheng; He, Yuxian

    2012-01-01

    The core structure of HIV-1 gp41 is a stable six-helix bundle (6-HB) folded by its trimeric N- and C-terminal heptad repeats (NHR and CHR). We previously identified that the 621QIWNNMT627 motif located at the upstream region of gp41 CHR plays critical roles for the stabilization of the 6-HB core and peptide CP621–652 containing this motif is a potent HIV-1 fusion inhibitor, however, the molecular determinants underlying the stability and anti-HIV activity remained elusive. In this study, we determined the high-resolution crystal structure of CP621–652 complexed by T21. We find that the 621QIWNNMT627 motif does not maintain the α-helical conformation. Instead, residues Met626 and Thr627 form a unique hook-like structure (denoted as M-T hook), in which Thr627 redirects the peptide chain to position Met626 above the left side of the hydrophobic pocket on the NHR trimer. The side chain of Met626 caps the hydrophobic pocket, stabilizing the interaction between the pocket and the pocket-binding domain. Our mutagenesis studies demonstrate that mutations of the M-T hook residues could completely abolish HIV-1 Env-mediated cell fusion and virus entry, and significantly destabilize the interaction of NHR and CHR peptides and reduce the anti-HIV activity of CP621–652. Our results identify an unusual structural feature that stabilizes the six-helix bundle, providing novel insights into the mechanisms of HIV-1 fusion and inhibition. PMID:22511760

  1. Functional identification of a Lippia dulcis bornyl diphosphate synthase that contains a duplicated, inhibitory arginine-rich motif.

    PubMed

    Hurd, Matthew C; Kwon, Moonhyuk; Ro, Dae-Kyun

    2017-08-26

    Lippia dulcis (Aztec sweet herb) contains the potent natural sweetener hernandulcin, a sesquiterpene ketone found in the leaves and flowers. Utilizing the leaves for agricultural application is challenging due to the presence of the bitter-tasting and toxic monoterpene, camphor. To unlock the commercial potential of L. dulcis leaves, the first step of camphor biosynthesis by a bornyl diphosphate synthase needs to be elucidated. Two putative monoterpene synthases (LdTPS3 and LdTPS9) were isolated from L. dulcis leaf cDNA. To elucidate their catalytic functions, E. coli-produced recombinant enzymes with truncations of their chloroplast transit peptides were assayed with geranyl diphosphate (GPP). In vitro enzyme assays showed that LdTPS3 encodes bornyl diphosphate synthase (thus named LdBPPS) while LdTPS9 encodes linalool synthase. Interestingly, the N-terminus of LdBPPS possesses two arginine-rich (RRX 8 W) motifs, and enzyme assays showed that the presence of both RRX 8 W motifs completely inhibits the catalytic activity of LdBPPS. Only after the removal of the putative chloroplast transit peptide and the first RRX 8 W, LdBPPS could react with GPP to produce bornyl diphosphate. LdBPPS is distantly related to the known bornyl diphosphate synthase from sage in a phylogenetic analysis, indicating a converged evolution of camphor biosynthesis in sage and L. dulcis. The discovery of LdBPPS opens up the possibility of engineering L. dulcis to remove the undesirable product, camphor. Copyright © 2017 Elsevier Inc. All rights reserved.

  2. Autoinhibition and signaling by the switch II motif in the G-protein chaperone of a radical B12 enzyme.

    PubMed

    Lofgren, Michael; Koutmos, Markos; Banerjee, Ruma

    2013-10-25

    MeaB is an accessory GTPase protein involved in the assembly, protection, and reactivation of 5'-deoxyadenosyl cobalamin-dependent methylmalonyl-CoA mutase (MCM). Mutations in the human ortholog of MeaB result in methylmalonic aciduria, an inborn error of metabolism. G-proteins typically utilize conserved switch I and II motifs for signaling to effector proteins via conformational changes elicited by nucleotide binding and hydrolysis. Our recent discovery that MeaB utilizes an unusual switch III region for bidirectional signaling with MCM raised questions about the roles of the switch I and II motifs in MeaB. In this study, we addressed the functions of conserved switch II residues by performing alanine-scanning mutagenesis. Our results demonstrate that the GTPase activity of MeaB is autoinhibited by switch II and that this loop is important for coupling nucleotide-sensitive conformational changes in switch III to elicit the multiple chaperone functions of MeaB. Furthermore, we report the structure of MeaB·GDP crystallized in the presence of AlFx(-) to form the putative transition state analog, GDP·AlF4(-). The resulting crystal structure and its comparison with related G-proteins support the conclusion that the catalytic site of MeaB is incomplete in the absence of the GTPase-activating protein MCM and therefore unable to stabilize the transition state analog. Favoring an inactive conformation in the absence of the client MCM protein might represent a strategy for suppressing the intrinsic GTPase activity of MeaB in which the switch II loop plays an important role.

  3. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach.

    PubMed

    Pan, Xiaoyong; Shen, Hong-Bin

    2017-02-28

    RNAs play key roles in cells through the interactions with proteins known as the RNA-binding proteins (RBP) and their binding motifs enable crucial understanding of the post-transcriptional regulation of RNAs. How the RBPs correctly recognize the target RNAs and why they bind specific positions is still far from clear. Machine learning-based algorithms are widely acknowledged to be capable of speeding up this process. Although many automatic tools have been developed to predict the RNA-protein binding sites from the rapidly growing multi-resource data, e.g. sequence, structure, their domain specific features and formats have posed significant computational challenges. One of current difficulties is that the cross-source shared common knowledge is at a higher abstraction level beyond the observed data, resulting in a low efficiency of direct integration of observed data across domains. The other difficulty is how to interpret the prediction results. Existing approaches tend to terminate after outputting the potential discrete binding sites on the sequences, but how to assemble them into the meaningful binding motifs is a topic worth of further investigation. In viewing of these challenges, we propose a deep learning-based framework (iDeep) by using a novel hybrid convolutional neural network and deep belief network to predict the RBP interaction sites and motifs on RNAs. This new protocol is featured by transforming the original observed data into a high-level abstraction feature space using multiple layers of learning blocks, where the shared representations across different domains are integrated. To validate our iDeep method, we performed experiments on 31 large-scale CLIP-seq datasets, and our results show that by integrating multiple sources of data, the average AUC can be improved by 8% compared to the best single-source-based predictor; and through cross-domain knowledge integration at an abstraction level, it outperforms the state-of-the-art predictors by 6%. Besides the overall enhanced prediction performance, the convolutional neural network module embedded in iDeep is also able to automatically capture the interpretable binding motifs for RBPs. Large-scale experiments demonstrate that these mined binding motifs agree well with the experimentally verified results, suggesting iDeep is a promising approach in the real-world applications. The iDeep framework not only can achieve promising performance than the state-of-the-art predictors, but also easily capture interpretable binding motifs. iDeep is available at http://www.csbio.sjtu.edu.cn/bioinf/iDeep.

  4. Computational mining for hypothetical patterns of amino acid side chains in protein data bank (PDB)

    NASA Astrophysics Data System (ADS)

    Ghani, Nur Syatila Ab; Firdaus-Raih, Mohd

    2018-04-01

    The three-dimensional structure of a protein can provide insights regarding its function. Functional relationship between proteins can be inferred from fold and sequence similarities. In certain cases, sequence or fold comparison fails to conclude homology between proteins with similar mechanism. Since the structure is more conserved than the sequence, a constellation of functional residues can be similarly arranged among proteins of similar mechanism. Local structural similarity searches are able to detect such constellation of amino acids among distinct proteins, which can be useful to annotate proteins of unknown function. Detection of such patterns of amino acids on a large scale can increase the repertoire of important 3D motifs since available known 3D motifs currently, could not compensate the ever-increasing numbers of uncharacterized proteins to be annotated. Here, a computational platform for an automated detection of 3D motifs is described. A fuzzy-pattern searching algorithm derived from IMagine an Amino Acid 3D Arrangement search EnGINE (IMAAAGINE) was implemented to develop an automated method for searching of hypothetical patterns of amino acid side chains in Protein Data Bank (PDB), without the need for prior knowledge on related sequence or structure of pattern of interest. We present an example of the searches, which is the detection of a hypothetical pattern derived from known structural motif of C2H2 structural pattern from zinc fingers. The conservation of particular patterns of amino acid side chains in unrelated proteins is highlighted. This approach can act as a complementary method for available structure- and sequence-based platforms and may contribute in improving functional association between proteins.

  5. Deciphering common recognition principles of nucleoside mono/di and tri-phosphates binding in diverse proteins via structural matching of their binding sites.

    PubMed

    Bhagavat, Raghu; Srinivasan, Narayanaswamy; Chandra, Nagasuma

    2017-09-01

    Nucleoside triphosphate (NTP) ligands are of high biological importance and are essential for all life forms. A pre-requisite for them to participate in diverse biochemical processes is their recognition by diverse proteins. It is thus of great interest to understand the basis for such recognition in different proteins. Towards this, we have used a structural bioinformatics approach and analyze structures of 4677 NTP complexes available in Protein Data Bank (PDB). Binding sites were extracted and compared exhaustively using PocketMatch, a sensitive in-house site comparison algorithm, which resulted in grouping the entire dataset into 27 site-types. Each of these site-types represent a structural motif comprised of two or more residue conservations, derived using another in-house tool for superposing binding sites, PocketAlign. The 27 site-types could be grouped further into 9 super-types by considering partial similarities in the sites, which indicated that the individual site-types comprise different combinations of one or more site features. A scan across PDB using the 27 structural motifs determined the motifs to be specific to NTP binding sites, and a computational alanine mutagenesis indicated that residues identified to be highly conserved in the motifs are also most contributing to binding. Alternate orientations of the ligand in several site-types were observed and rationalized, indicating the possibility of some residues serving as anchors for NTP recognition. The presence of multiple site-types and the grouping of multiple folds into each site-type is strongly suggestive of convergent evolution. Knowledge of determinants obtained from this study will be useful for detecting function in unknown proteins. Proteins 2017; 85:1699-1712. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  6. Discovery-2: an interactive resource for the rational selection and comparison of putative drug target proteins in malaria

    PubMed Central

    2013-01-01

    Background Drug resistance to anti-malarial compounds remains a serious problem, with resistance to newer pharmaceuticals developing at an alarming rate. The development of new anti-malarials remains a priority, and the rational selection of putative targets is a key element of this process. Discovery-2 is an update of the original Discovery in silico resource for the rational selection of putative drug target proteins, enabling researchers to obtain information for a protein which may be useful for the selection of putative drug targets, and to perform advanced filtering of proteins encoded by the malaria genome based on a series of molecular properties. Methods An updated in silico resource has been developed where researchers are able to mine information on malaria proteins and predicted ligands, as well as perform comparisons to the human and mosquito host characteristics. Protein properties used include: domains, motifs, EC numbers, GO terms, orthologs, protein-protein interactions, protein-ligand interactions. Newly added features include drugability measures from ChEMBL, automated literature relations and links to clinical trial information. Searching by chemical structure is also available. Results The updated functionality of the Discovery-2 resource is presented, together with a detailed case study of the Plasmodium falciparum S-adenosyl-L-homocysteine hydrolase (PfSAHH) protein. A short example of a chemical search with pyrimethamine is also illustrated. Conclusion The updated Discovery-2 resource allows researchers to obtain detailed properties of proteins from the malaria genome, which may be of interest in the target selection process, and to perform advanced filtering and selection of proteins based on a relevant range of molecular characteristics. PMID:23537208

  7. Alchemist multimodal workflows for diabetic retinopathy research, disease prevention and investigational drug discovery.

    PubMed

    Riposan, Adina; Taylor, Ian; Owens, David R; Rana, Omer; Conley, Edward C

    2007-01-01

    In this paper we present mechanisms for imaging and spectral data discovery, as applied to the early detection of pathologic mechanisms underlying diabetic retinopathy in research and clinical trial scenarios. We discuss the Alchemist framework, built using a generic peer-to-peer architecture, supporting distributed database queries and complex search algorithms based on workflow. The Alchemist is a domain-independent search mechanism that can be applied to search and data discovery scenarios in many areas. We illustrate Alchemist's ability to perform complex searches composed as a collection of peer-to-peer overlays, Grid-based services and workflows, e.g. applied to image and spectral data discovery, as applied to the early detection and prevention of retinal disease and investigational drug discovery. The Alchemist framework is built on top of decentralised technologies and uses industry standards such as Web services and SOAP for messaging.

  8. Synthesis of most polyene natural product motifs using just twelve building blocks and one coupling reaction

    PubMed Central

    Woerly, Eric M.; Roy, Jahnabi; Burke, Martin D.

    2014-01-01

    The inherent modularity of polypeptides, oligonucleotides, and oligosaccharides has been harnessed to achieve generalized building block-based synthesis platforms. Importantly, like these other targets, most small molecule natural products are biosynthesized via iterative coupling of bifunctional building blocks. This suggests that many small molecules also possess inherent modularity commensurate with systematic building block-based construction. Supporting this hypothesis, here we report that the polyene motifs found in >75% of all known polyene natural products can be synthesized using just 12 building blocks and one coupling reaction. Using the same general retrosynthetic algorithm and reaction conditions, this platform enabled the synthesis of a wide range of polyene frameworks covering all of this natural product chemical space, and first total syntheses of the polyene natural products asnipyrone B, physarigin A, and neurosporaxanthin β-D-glucopyranoside. Collectively, these results suggest the potential for a more generalized approach for making small molecules in the laboratory. PMID:24848233

  9. Unlocking the chemotherapeutic potential of beta-aminovinyl ketones and related compounds.

    PubMed

    Gaber, Hatem M; Bagley, Mark C

    2009-07-01

    The role of beta-aminovinyl ketones as synthetic intermediates has been well categorised, but recent developments have shown an interesting array of applications and new chemotherapeutic potential, both in the preparation of biologically active heterocycles and as pharmacophores in their own right.Medicinal chemists are accustomed to using the products of Knoevenagel-type condensations as auxiliaries for the synthesis of N-containing heteroaromatic compounds. One such example of these chemical building blocks are beta-aminovinyl ketones-valuable synthetic intermediates that have been used in the preparation of pyridines, pyrimidines, pyrazoles, and many other heterocyclic motifs. This review highlights their recent use in the synthesis of biologically active targets as part of drug discovery programmes and in natural product synthesis. However, it is becoming increasingly evident that the enaminone motif may serve as a therapeutic pharmacophore in its own right. This review highlights the range of biological responses that beta-aminovinyl ketones elicit, including as antitumour, antibacterial, and anticonvulsant agents. Thus, with a broad spectrum of biological properties and as versatile chemical intermediates, it is clear that beta-aminovinyl ketones offer great potential in the search for new chemotherapeutic agents.

  10. Production of ultrasonic vocalizations by Peromyscus mice in the wild

    PubMed Central

    Kalcounis-Rueppell, Matina C; Metheny, Jackie D; Vonhof, Maarten J

    2006-01-01

    Background There has been considerable research on rodent ultrasound in the laboratory and these sounds have been well quantified and characterized. Despite the value of research on ultrasound produced by mice in the lab, it is unclear if, and when, these sounds are produced in the wild, and how they function in natural habitats. Results We have made the first recordings of ultrasonic vocalizations produced by two free-living species of mice in the genus Peromyscus (P. californicus and P. boylii) on long term study grids in California. Over 6 nights, we recorded 65 unique ultrasonic vocalization phrases from Peromyscus. The ultrasonic vocalizations we recorded represent 7 different motifs. Within each motif, there was considerable variation in the acoustic characteristics suggesting individual and contextual variation in the production of ultrasound by these species. Conclusion The discovery of the production of ultrasonic vocalizations by Peromyscus in the wild highlights an underappreciated component in the behavior of these model organisms. The ability to examine the production of ultrasonic vocalizations in the wild offers excellent opportunities to test hypotheses regarding the function of ultrasound produced by rodents in a natural context. PMID:16507093

  11. Mammalian Protein Arginine Methyltransferase 7 (PRMT7) Specifically Targets RXR Sites in Lysine- and Arginine-rich Regions*

    PubMed Central

    Feng, You; Maity, Ranjan; Whitelegge, Julian P.; Hadjikyriacou, Andrea; Li, Ziwei; Zurita-Lopez, Cecilia; Al-Hadid, Qais; Clark, Amander T.; Bedford, Mark T.; Masson, Jean-Yves; Clarke, Steven G.

    2013-01-01

    The mammalian protein arginine methyltransferase 7 (PRMT7) has been implicated in roles of transcriptional regulation, DNA damage repair, RNA splicing, cell differentiation, and metastasis. However, the type of reaction that it catalyzes and its substrate specificity remain controversial. In this study, we purified a recombinant mouse PRMT7 expressed in insect cells that demonstrates a robust methyltransferase activity. Using a variety of substrates, we demonstrate that the enzyme only catalyzes the formation of ω-monomethylarginine residues, and we confirm its activity as the prototype type III protein arginine methyltransferase. This enzyme is active on all recombinant human core histones, but histone H2B is a highly preferred substrate. Analysis of the specific methylation sites within intact histone H2B and within H2B and H4 peptides revealed novel post-translational modification sites and a unique specificity of PRMT7 for methylating arginine residues in lysine- and arginine-rich regions. We demonstrate that a prominent substrate recognition motif consists of a pair of arginine residues separated by one residue (RXR motif). These findings will significantly accelerate substrate profile analysis, biological function study, and inhibitor discovery for PRMT7. PMID:24247247

  12. Mammalian protein arginine methyltransferase 7 (PRMT7) specifically targets RXR sites in lysine- and arginine-rich regions.

    PubMed

    Feng, You; Maity, Ranjan; Whitelegge, Julian P; Hadjikyriacou, Andrea; Li, Ziwei; Zurita-Lopez, Cecilia; Al-Hadid, Qais; Clark, Amander T; Bedford, Mark T; Masson, Jean-Yves; Clarke, Steven G

    2013-12-27

    The mammalian protein arginine methyltransferase 7 (PRMT7) has been implicated in roles of transcriptional regulation, DNA damage repair, RNA splicing, cell differentiation, and metastasis. However, the type of reaction that it catalyzes and its substrate specificity remain controversial. In this study, we purified a recombinant mouse PRMT7 expressed in insect cells that demonstrates a robust methyltransferase activity. Using a variety of substrates, we demonstrate that the enzyme only catalyzes the formation of ω-monomethylarginine residues, and we confirm its activity as the prototype type III protein arginine methyltransferase. This enzyme is active on all recombinant human core histones, but histone H2B is a highly preferred substrate. Analysis of the specific methylation sites within intact histone H2B and within H2B and H4 peptides revealed novel post-translational modification sites and a unique specificity of PRMT7 for methylating arginine residues in lysine- and arginine-rich regions. We demonstrate that a prominent substrate recognition motif consists of a pair of arginine residues separated by one residue (RXR motif). These findings will significantly accelerate substrate profile analysis, biological function study, and inhibitor discovery for PRMT7.

  13. Harnessing the potential of natural products in drug discovery from a cheminformatics vantage point.

    PubMed

    Rodrigues, Tiago

    2017-11-15

    Natural products (NPs) present a privileged source of inspiration for chemical probe and drug design. Despite the biological pre-validation of the underlying molecular architectures and their relevance in drug discovery, the poor accessibility to NPs, complexity of the synthetic routes and scarce knowledge of their macromolecular counterparts in phenotypic screens still hinder their broader exploration. Cheminformatics algorithms now provide a powerful means of circumventing the abovementioned challenges and unlocking the full potential of NPs in a drug discovery context. Herein, I discuss recent advances in the computer-assisted design of NP mimics and how artificial intelligence may accelerate future NP-inspired molecular medicine.

  14. Computational strategies for the automated design of RNA nanoscale structures from building blocks using NanoTiler.

    PubMed

    Bindewald, Eckart; Grunewald, Calvin; Boyle, Brett; O'Connor, Mary; Shapiro, Bruce A

    2008-10-01

    One approach to designing RNA nanoscale structures is to use known RNA structural motifs such as junctions, kissing loops or bulges and to construct a molecular model by connecting these building blocks with helical struts. We previously developed an algorithm for detecting internal loops, junctions and kissing loops in RNA structures. Here we present algorithms for automating or assisting many of the steps that are involved in creating RNA structures from building blocks: (1) assembling building blocks into nanostructures using either a combinatorial search or constraint satisfaction; (2) optimizing RNA 3D ring structures to improve ring closure; (3) sequence optimisation; (4) creating a unique non-degenerate RNA topology descriptor. This effectively creates a computational pipeline for generating molecular models of RNA nanostructures and more specifically RNA ring structures with optimized sequences from RNA building blocks. We show several examples of how the algorithms can be utilized to generate RNA tecto-shapes.

  15. Computational strategies for the automated design of RNA nanoscale structures from building blocks using NanoTiler☆

    PubMed Central

    Bindewald, Eckart; Grunewald, Calvin; Boyle, Brett; O’Connor, Mary; Shapiro, Bruce A.

    2013-01-01

    One approach to designing RNA nanoscale structures is to use known RNA structural motifs such as junctions, kissing loops or bulges and to construct a molecular model by connecting these building blocks with helical struts. We previously developed an algorithm for detecting internal loops, junctions and kissing loops in RNA structures. Here we present algorithms for automating or assisting many of the steps that are involved in creating RNA structures from building blocks: (1) assembling building blocks into nanostructures using either a combinatorial search or constraint satisfaction; (2) optimizing RNA 3D ring structures to improve ring closure; (3) sequence optimisation; (4) creating a unique non-degenerate RNA topology descriptor. This effectively creates a computational pipeline for generating molecular models of RNA nanostructures and more specifically RNA ring structures with optimized sequences from RNA building blocks. We show several examples of how the algorithms can be utilized to generate RNA tecto-shapes. PMID:18838281

  16. Uniform, optimal signal processing of mapped deep-sequencing data.

    PubMed

    Kumar, Vibhor; Muratani, Masafumi; Rayan, Nirmala Arul; Kraus, Petra; Lufkin, Thomas; Ng, Huck Hui; Prabhakar, Shyam

    2013-07-01

    Despite their apparent diversity, many problems in the analysis of high-throughput sequencing data are merely special cases of two general problems, signal detection and signal estimation. Here we adapt formally optimal solutions from signal processing theory to analyze signals of DNA sequence reads mapped to a genome. We describe DFilter, a detection algorithm that identifies regulatory features in ChIP-seq, DNase-seq and FAIRE-seq data more accurately than assay-specific algorithms. We also describe EFilter, an estimation algorithm that accurately predicts mRNA levels from as few as 1-2 histone profiles (R ∼0.9). Notably, the presence of regulatory motifs in promoters correlates more with histone modifications than with mRNA levels, suggesting that histone profiles are more predictive of cis-regulatory mechanisms. We show by applying DFilter and EFilter to embryonic forebrain ChIP-seq data that regulatory protein identification and functional annotation are feasible despite tissue heterogeneity. The mathematical formalism underlying our tools facilitates integrative analysis of data from virtually any sequencing-based functional profile.

  17. Approximate matching of structured motifs in DNA sequences.

    PubMed

    El-Mabrouk, Nadia; Raffinot, Mathieu; Duchesne, Jean-Eudes; Lajoie, Mathieu; Luc, Nicolas

    2005-04-01

    Several methods have been developed for identifying more or less complex RNA structures in a genome. All these methods are based on the search for conserved primary and secondary sub-structures. In this paper, we present a simple formal representation of a helix, which is a combination of sequence and folding constraints, as a constrained regular expression. This representation allows us to develop a well-founded algorithm that searches for all approximate matches of a helix in a genome. The algorithm is based on an alignment graph constructed from several copies of a pushdown automaton, arranged one on top of another. This is a first attempt to take advantage of the possibilities of pushdown automata in the context of approximate matching. The worst time complexity is O(krpn), where k is the error threshold, n the size of the genome, p the size of the secondary expression, and r its number of union symbols. We then extend the algorithm to search for pseudo-knots and secondary structures containing an arbitrary number of helices.

  18. False Discovery Control in Large-Scale Spatial Multiple Testing

    PubMed Central

    Sun, Wenguang; Reich, Brian J.; Cai, T. Tony; Guindani, Michele; Schwartzman, Armin

    2014-01-01

    Summary This article develops a unified theoretical and computational framework for false discovery control in multiple testing of spatial signals. We consider both point-wise and cluster-wise spatial analyses, and derive oracle procedures which optimally control the false discovery rate, false discovery exceedance and false cluster rate, respectively. A data-driven finite approximation strategy is developed to mimic the oracle procedures on a continuous spatial domain. Our multiple testing procedures are asymptotically valid and can be effectively implemented using Bayesian computational algorithms for analysis of large spatial data sets. Numerical results show that the proposed procedures lead to more accurate error control and better power performance than conventional methods. We demonstrate our methods for analyzing the time trends in tropospheric ozone in eastern US. PMID:25642138

  19. Molecular population dynamics of DNA structures in a bcl-2 promoter sequence is regulated by small molecules and the transcription factor hnRNP LL

    PubMed Central

    Cui, Yunxi; Koirala, Deepak; Kang, HyunJin; Dhakal, Soma; Yangyuoru, Philip; Hurley, Laurence H.; Mao, Hanbin

    2014-01-01

    Minute difference in free energy change of unfolding among structures in an oligonucleotide sequence can lead to a complex population equilibrium, which is rather challenging for ensemble techniques to decipher. Herein, we introduce a new method, molecular population dynamics (MPD), to describe the intricate equilibrium among non-B deoxyribonucleic acid (DNA) structures. Using mechanical unfolding in laser tweezers, we identified six DNA species in a cytosine (C)-rich bcl-2 promoter sequence. Population patterns of these species with and without a small molecule (IMC-76 or IMC-48) or the transcription factor hnRNP LL are compared to reveal the MPD of different species. With a pattern recognition algorithm, we found that IMC-48 and hnRNP LL share 80% similarity in stabilizing i-motifs with 60 s incubation. In contrast, IMC-76 demonstrates an opposite behavior, preferring flexible DNA hairpins. With 120–180 s incubation, IMC-48 and hnRNP LL destabilize i-motifs, which has been previously proposed to activate bcl-2 transcriptions. These results provide strong support, from the population equilibrium perspective, that small molecules and hnRNP LL can modulate bcl-2 transcription through interaction with i-motifs. The excellent agreement with biochemical results firmly validates the MPD analyses, which, we expect, can be widely applicable to investigate complex equilibrium of biomacromolecules. PMID:24609386

  20. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis

    PubMed Central

    Bussemaker, Harmen J.; Li, Hao; Siggia, Eric D.

    2000-01-01

    The availability of complete genome sequences and mRNA expression data for all genes creates new opportunities and challenges for identifying DNA sequence motifs that control gene expression. An algorithm, “MobyDick,” is presented that decomposes a set of DNA sequences into the most probable dictionary of motifs or words. This method is applicable to any set of DNA sequences: for example, all upstream regions in a genome or all genes expressed under certain conditions. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter ones of various lengths, eliminating the need for a separate set of reference data to define probabilities. We have built a dictionary with 1,200 words for the 6,000 upstream regulatory regions in the yeast genome; the 500 most significant words (some with as few as 10 copies in all of the upstream regions) match 114 of 443 experimentally determined sites (a significance level of 18 standard deviations). When analyzing all of the genes up-regulated during sporulation as a group, we find many motifs in addition to the few previously identified by analyzing the subclusters individually to the expression subclusters. Applying MobyDick to the genes derepressed when the general repressor Tup1 is deleted, we find known as well as putative binding sites for its regulatory partners. PMID:10944202

  1. Pattern Discovery and Change Detection of Online Music Query Streams

    NASA Astrophysics Data System (ADS)

    Li, Hua-Fu

    In this paper, an efficient stream mining algorithm, called FTP-stream (Frequent Temporal Pattern mining of streams), is proposed to find the frequent temporal patterns over melody sequence streams. In the framework of our proposed algorithm, an effective bit-sequence representation is used to reduce the time and memory needed to slide the windows. The FTP-stream algorithm can calculate the support threshold in only a single pass based on the concept of bit-sequence representation. It takes the advantage of "left" and "and" operations of the representation. Experiments show that the proposed algorithm only scans the music query stream once, and runs significant faster and consumes less memory than existing algorithms, such as SWFI-stream and Moment.

  2. iPTF Discoveries of Recent Type Ia Supernovae

    NASA Astrophysics Data System (ADS)

    Petrushevska, T.; Ferretti, R.; Fremling, C.; Hangard, L.; Johansson, J.; Migotto, K.; Nyholm, A.; Papadogiannakis, S.; Ben-Ami, S.; De Cia, A.; Dzigan, Y.; Horesh, A.; Leloudas, G.; Manulis, I.; Rubin, A.; Sagiv, I.; Vreeswijk, P.; Yaron, O.; Cao, Y.; Perley, D.; Miller, A.; Waszczak, A.; Kasliwal, M. M.; Hosseinzadeh, G.; Cenko, S. B.; Quimby, R.

    2015-05-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery and classification of the following Type Ia SNe. Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R) and RB5 (Wozniak et al. 2013AAS...22143105W). See ATel #7112 for additional details.

  3. Socio-Culturally Oriented Plan Discovery Environment (SCOPE)

    DTIC Science & Technology

    2005-05-01

    U.S. intelligence methods (Dr. George Friedman ( 2003 ) Saddam Hussein and the Dollar War. THE STRATFOR WEEKLY 18 December) 8 2.2. Evidence... 2003 ). In the EAGLE setting, we are using a modified version of the fuzzy segmentation algorithm developed by Udupa and his associates to...based (Fu, et al, 2003 ) and a cognitive model based (Eilbert, et al., 2002) algorithms, and a method for combining the results. (The method for

  4. Guided Discovery, Visualization, and Technology Applied to the New Curriculum for Secondary Mathematics.

    ERIC Educational Resources Information Center

    Smith, Karan B.

    1996-01-01

    Presents activities which highlight major concepts of linear programming. Demonstrates how technology allows students to solve linear programming problems using exploration prior to learning algorithmic methods. (DDR)

  5. Zebra Crossing Spotter: Automatic Population of Spatial Databases for Increased Safety of Blind Travelers

    PubMed Central

    Ahmetovic, Dragan; Manduchi, Roberto; Coughlan, James M.; Mascetti, Sergio

    2016-01-01

    In this paper we propose a computer vision-based technique that mines existing spatial image databases for discovery of zebra crosswalks in urban settings. Knowing the location of crosswalks is critical for a blind person planning a trip that includes street crossing. By augmenting existing spatial databases (such as Google Maps or OpenStreetMap) with this information, a blind traveler may make more informed routing decisions, resulting in greater safety during independent travel. Our algorithm first searches for zebra crosswalks in satellite images; all candidates thus found are validated against spatially registered Google Street View images. This cascaded approach enables fast and reliable discovery and localization of zebra crosswalks in large image datasets. While fully automatic, our algorithm could also be complemented by a final crowdsourcing validation stage for increased accuracy. PMID:26824080

  6. Automated In Vivo Platform for the Discovery of Functional Food Treatments of Hypercholesterolemia

    PubMed Central

    Littleton, Robert M.; Haworth, Kevin J.; Tang, Hong; Setchell, Kenneth D. R.; Nelson, Sandra; Hove, Jay R.

    2013-01-01

    The zebrafish is becoming an increasingly popular model system for both automated drug discovery and investigating hypercholesterolemia. Here we combine these aspects and for the first time develop an automated high-content confocal assay for treatments of hypercholesterolemia. We also create two algorithms for automated analysis of cardiodynamic data acquired by high-speed confocal microscopy. The first algorithm computes cardiac parameters solely from the frequency-domain representation of cardiodynamic data while the second uses both frequency- and time-domain data. The combined approach resulted in smaller differences relative to manual measurements. The methods are implemented to test the ability of a methanolic extract of the hawthorn plant (Crataegus laevigata) to treat hypercholesterolemia and its peripheral cardiovascular effects. Results demonstrate the utility of these methods and suggest the extract has both antihypercholesterolemic and postitively inotropic properties. PMID:23349685

  7. Automated in vivo platform for the discovery of functional food treatments of hypercholesterolemia.

    PubMed

    Littleton, Robert M; Haworth, Kevin J; Tang, Hong; Setchell, Kenneth D R; Nelson, Sandra; Hove, Jay R

    2013-01-01

    The zebrafish is becoming an increasingly popular model system for both automated drug discovery and investigating hypercholesterolemia. Here we combine these aspects and for the first time develop an automated high-content confocal assay for treatments of hypercholesterolemia. We also create two algorithms for automated analysis of cardiodynamic data acquired by high-speed confocal microscopy. The first algorithm computes cardiac parameters solely from the frequency-domain representation of cardiodynamic data while the second uses both frequency- and time-domain data. The combined approach resulted in smaller differences relative to manual measurements. The methods are implemented to test the ability of a methanolic extract of the hawthorn plant (Crataegus laevigata) to treat hypercholesterolemia and its peripheral cardiovascular effects. Results demonstrate the utility of these methods and suggest the extract has both antihypercholesterolemic and postitively inotropic properties.

  8. The mass-action law based algorithm for cost-effective approach for cancer drug discovery and development.

    PubMed

    Chou, Ting-Chao

    2011-01-01

    The mass-action law based system analysis via mathematical induction and deduction lead to the generalized theory and algorithm that allows computerized simulation of dose-effect dynamics with small size experiments using a small number of data points in vitro, in animals, and in humans. The median-effect equation of the mass-action law deduced from over 300 mechanism specific-equations has been shown to be the unified theory that serves as the common-link for complicated biomedical systems. After using the median-effect principle as the common denominator, its applications are mechanism-independent, drug unit-independent, and dynamic order-independent; and can be used generally for single drug analysis or for multiple drug combinations in constant-ratio or non-constant ratios. Since the "median" is the common link and universal reference point in biological systems, these general enabling lead to computerized quantitative bio-informatics for econo-green bio-research in broad disciplines. Specific applications of the theory, especially relevant to drug discovery, drug combination, and clinical trials, have been cited or illustrated in terms of algorithms, experimental design and computerized simulation for data analysis. Lessons learned from cancer research during the past fifty years provide a valuable opportunity to reflect, and to improve the conventional divergent approach and to introduce a new convergent avenue, based on the mass-action law principle, for the efficient cancer drug discovery and the low-cost drug development.

  9. The mass-action law based algorithm for cost-effective approach for cancer drug discovery and development

    PubMed Central

    Chou, Ting-Chao

    2011-01-01

    The mass-action law based system analysis via mathematical induction and deduction lead to the generalized theory and algorithm that allows computerized simulation of dose-effect dynamics with small size experiments using a small number of data points in vitro, in animals, and in humans. The median-effect equation of the mass-action law deduced from over 300 mechanism specific-equations has been shown to be the unified theory that serves as the common-link for complicated biomedical systems. After using the median-effect principle as the common denominator, its applications are mechanism-independent, drug unit-independent, and dynamic order-independent; and can be used generally for single drug analysis or for multiple drug combinations in constant-ratio or non-constant ratios. Since the “median” is the common link and universal reference point in biological systems, these general enabling lead to computerized quantitative bio-informatics for econo-green bio-research in broad disciplines. Specific applications of the theory, especially relevant to drug discovery, drug combination, and clinical trials, have been cited or illustrated in terms of algorithms, experimental design and computerized simulation for data analysis. Lessons learned from cancer research during the past fifty years provide a valuable opportunity to reflect, and to improve the conventional divergent approach and to introduce a new convergent avenue, based on the mass-action law principle, for the efficient cancer drug discovery and the low-cost drug development. PMID:22016837

  10. PCM-SABRE: a platform for benchmarking and comparing outcome prediction methods in precision cancer medicine.

    PubMed

    Eyal-Altman, Noah; Last, Mark; Rubin, Eitan

    2017-01-17

    Numerous publications attempt to predict cancer survival outcome from gene expression data using machine-learning methods. A direct comparison of these works is challenging for the following reasons: (1) inconsistent measures used to evaluate the performance of different models, and (2) incomplete specification of critical stages in the process of knowledge discovery. There is a need for a platform that would allow researchers to replicate previous works and to test the impact of changes in the knowledge discovery process on the accuracy of the induced models. We developed the PCM-SABRE platform, which supports the entire knowledge discovery process for cancer outcome analysis. PCM-SABRE was developed using KNIME. By using PCM-SABRE to reproduce the results of previously published works on breast cancer survival, we define a baseline for evaluating future attempts to predict cancer outcome with machine learning. We used PCM-SABRE to replicate previous work that describe predictive models of breast cancer recurrence, and tested the performance of all possible combinations of feature selection methods and data mining algorithms that was used in either of the works. We reconstructed the work of Chou et al. observing similar trends - superior performance of Probabilistic Neural Network (PNN) and logistic regression (LR) algorithms and inconclusive impact of feature pre-selection with the decision tree algorithm on subsequent analysis. PCM-SABRE is a software tool that provides an intuitive environment for rapid development of predictive models in cancer precision medicine.

  11. Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing

    PubMed Central

    Lohr, Jens G.; Stojanov, Petar; Lawrence, Michael S.; Auclair, Daniel; Chapuy, Bjoern; Sougnez, Carrie; Cruz-Gordillo, Peter; Knoechel, Birgit; Asmann, Yan W.; Slager, Susan L.; Novak, Anne J.; Dogan, Ahmet; Ansell, Stephen M.; Zou, Lihua; Gould, Joshua; Saksena, Gordon; Stransky, Nicolas; Rangel-Escareño, Claudia; Fernandez-Lopez, Juan Carlos; Hidalgo-Miranda, Alfredo; Melendez-Zajgla, Jorge; Hernández-Lemus, Enrique; Schwarz-Cruz y Celis, Angela; Imaz-Rosshandler, Ivan; Ojesina, Akinyemi I.; Jung, Joonil; Pedamallu, Chandra S.; Lander, Eric S.; Habermann, Thomas M.; Cerhan, James R.; Shipp, Margaret A.; Getz, Gad; Golub, Todd R.

    2012-01-01

    To gain insight into the genomic basis of diffuse large B-cell lymphoma (DLBCL), we performed massively parallel whole-exome sequencing of 55 primary tumor samples from patients with DLBCL and matched normal tissue. We identified recurrent mutations in genes that are well known to be functionally relevant in DLBCL, including MYD88, CARD11, EZH2, and CREBBP. We also identified somatic mutations in genes for which a functional role in DLBCL has not been previously suspected. These genes include MEF2B, MLL2, BTG1, GNA13, ACTB, P2RY8, PCLO, and TNFRSF14. Further, we show that BCL2 mutations commonly occur in patients with BCL2/IgH rearrangements as a result of somatic hypermutation normally occurring at the IgH locus. The BCL2 point mutations are primarily synonymous, and likely caused by activation-induced cytidine deaminase–mediated somatic hypermutation, as shown by comprehensive analysis of enrichment of mutations in WRCY target motifs. Those nonsynonymous mutations that are observed tend to be found outside of the functionally important BH domains of the protein, suggesting that strong negative selection against BCL2 loss-of-function mutations is at play. Last, by using an algorithm designed to identify likely functionally relevant but infrequent mutations, we identify KRAS, BRAF, and NOTCH1 as likely drivers of DLBCL pathogenesis in some patients. Our data provide an unbiased view of the landscape of mutations in DLBCL, and this in turn may point toward new therapeutic strategies for the disease. PMID:22343534

  12. Firefly Algorithm for Structural Search.

    PubMed

    Avendaño-Franco, Guillermo; Romero, Aldo H

    2016-07-12

    The problem of computational structure prediction of materials is approached using the firefly (FF) algorithm. Starting from the chemical composition and optionally using prior knowledge of similar structures, the FF method is able to predict not only known stable structures but also a variety of novel competitive metastable structures. This article focuses on the strengths and limitations of the algorithm as a multimodal global searcher. The algorithm has been implemented in software package PyChemia ( https://github.com/MaterialsDiscovery/PyChemia ), an open source python library for materials analysis. We present applications of the method to van der Waals clusters and crystal structures. The FF method is shown to be competitive when compared to other population-based global searchers.

  13. Start-ups Bring AI to Pathology.

    PubMed

    2018-04-01

    New startups are developing pattern-recognition algorithms that could one day help pathologists more accurately spot tumors on digitized tissue images, thereby aiding in diagnosis, treatment, drug discovery, and more. ©2018 American Association for Cancer Research.

  14. Knowledge discovery with classification rules in a cardiovascular dataset.

    PubMed

    Podgorelec, Vili; Kokol, Peter; Stiglic, Milojka Molan; Hericko, Marjan; Rozman, Ivan

    2005-12-01

    In this paper we study an evolutionary machine learning approach to data mining and knowledge discovery based on the induction of classification rules. A method for automatic rules induction called AREX using evolutionary induction of decision trees and automatic programming is introduced. The proposed algorithm is applied to a cardiovascular dataset consisting of different groups of attributes which should possibly reveal the presence of some specific cardiovascular problems in young patients. A case study is presented that shows the use of AREX for the classification of patients and for discovering possible new medical knowledge from the dataset. The defined knowledge discovery loop comprises a medical expert's assessment of induced rules to drive the evolution of rule sets towards more appropriate solutions. The final result is the discovery of a possible new medical knowledge in the field of pediatric cardiology.

  15. Search for 5'-leader regulatory RNA structures based on gene annotation aided by the RiboGap database.

    PubMed

    Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan

    2017-03-15

    The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation. Copyright © 2017 Elsevier Inc. All rights reserved.

  16. Sparse Substring Pattern Set Discovery Using Linear Programming Boosting

    NASA Astrophysics Data System (ADS)

    Kashihara, Kazuaki; Hatano, Kohei; Bannai, Hideo; Takeda, Masayuki

    In this paper, we consider finding a small set of substring patterns which classifies the given documents well. We formulate the problem as 1 norm soft margin optimization problem where each dimension corresponds to a substring pattern. Then we solve this problem by using LPBoost and an optimal substring discovery algorithm. Since the problem is a linear program, the resulting solution is likely to be sparse, which is useful for feature selection. We evaluate the proposed method for real data such as movie reviews.

  17. iPTF discovery and identification of bright transients

    NASA Astrophysics Data System (ADS)

    Adams, Scott; Karamehmetoglu, Emir; Roy, Rupak; Neill, James D.; Walters, Richard; Cook, Dave; Kupfer, Thomas; Cannella, Chris; Blagorodnova, Nadejda; Yan, Lin; Kasliwal, Mansi; Kulkarni, Shri

    2017-02-01

    The intermediate Palomar Transient Factory (ATel #4807) reports the discovery of the following bright transients. We report as ATel alerts all objects brighter than 19 mag. Our discoveries are reported in two filters: sdss-g and Mould-I, denoted as g and I. All magnitudes are obtained using difference image photometry based on the PTFIDE pipeline described in Masci et al. 2016.Our automated candidate vetting to distinguish a real astrophysical source (1.0) from bogus artifacts (0.0) is powered by three generations of machine learning algorithms: RB2 (Brink et al. 2013MNRAS.435.1047B), RB4 (Rebbapragada et al. 2015AAS...22543402R), and RB5 (Wozniak et al. 2013AAS...22143105W).

  18. Protein interactions in 3D: from interface evolution to drug discovery.

    PubMed

    Winter, Christof; Henschel, Andreas; Tuukkanen, Anne; Schroeder, Michael

    2012-09-01

    Over the past 10years, much research has been dedicated to the understanding of protein interactions. Large-scale experiments to elucidate the global structure of protein interaction networks have been complemented by detailed studies of protein interaction interfaces. Understanding the evolution of interfaces allows one to identify convergently evolved interfaces which are evolutionary unrelated but share a few key residues and hence have common binding partners. Understanding interaction interfaces and their evolution is an important basis for pharmaceutical applications in drug discovery. Here, we review the algorithms and databases on 3D protein interactions and discuss in detail applications in interface evolution, drug discovery, and interface prediction. Copyright © 2012 Elsevier Inc. All rights reserved.

  19. Fast Ss-Ilm a Computationally Efficient Algorithm to Discover Socially Important Locations

    NASA Astrophysics Data System (ADS)

    Dokuz, A. S.; Celik, M.

    2017-11-01

    Socially important locations are places which are frequently visited by social media users in their social media lifetime. Discovering socially important locations provide several valuable information about user behaviours on social media networking sites. However, discovering socially important locations are challenging due to data volume and dimensions, spatial and temporal calculations, location sparseness in social media datasets, and inefficiency of current algorithms. In the literature, several studies are conducted to discover important locations, however, the proposed approaches do not work in computationally efficient manner. In this study, we propose Fast SS-ILM algorithm by modifying the algorithm of SS-ILM to mine socially important locations efficiently. Experimental results show that proposed Fast SS-ILM algorithm decreases execution time of socially important locations discovery process up to 20 %.

  20. Wideband dichroic-filter design for LED-phosphor beam-combining

    DOEpatents

    Falicoff, Waqidi

    2010-12-28

    A general method is disclosed of designing two-component dichroic short-pass filters operable for incidence angle distributions over the 0-30.degree. range, and specific preferred embodiments are listed. The method is based on computer optimization algorithms for an N-layer design, specifically the N-dimensional conjugate-gradient minimization of a merit function based on difference from a target transmission spectrum, as well as subsequent cycles of needle synthesis for increasing N. A key feature of the method is the initial filter design, upon which the algorithm proceeds to iterate successive design candidates with smaller merit functions. This initial design, with high-index material H and low-index L, is (0.75 H, 0.5 L, 0.75 H)^m, denoting m (20-30) repetitions of a three-layer motif, giving rise to a filter with N=2 m+1.

  1. Topology-Scaling Identification of Layered Solids and Stable Exfoliated 2D Materials.

    PubMed

    Ashton, Michael; Paul, Joshua; Sinnott, Susan B; Hennig, Richard G

    2017-03-10

    The Materials Project crystal structure database has been searched for materials possessing layered motifs in their crystal structures using a topology-scaling algorithm. The algorithm identifies and measures the sizes of bonded atomic clusters in a structure's unit cell, and determines their scaling with cell size. The search yielded 826 stable layered materials that are considered as candidates for the formation of two-dimensional monolayers via exfoliation. Density-functional theory was used to calculate the exfoliation energy of each material and 680 monolayers emerge with exfoliation energies below those of already-existent two-dimensional materials. The crystal structures of these two-dimensional materials provide templates for future theoretical searches of stable two-dimensional materials. The optimized structures and other calculated data for all 826 monolayers are provided at our database (https://materialsweb.org).

  2. Amine-free melanin-concentrating hormone receptor 1 antagonists: Novel 1-(1H-benzimidazol-6-yl)pyridin-2(1H)-one derivatives and design to avoid CYP3A4 time-dependent inhibition.

    PubMed

    Igawa, Hideyuki; Takahashi, Masashi; Shirasaki, Mikio; Kakegawa, Keiko; Kina, Asato; Ikoma, Minoru; Aida, Jumpei; Yasuma, Tsuneo; Okuda, Shoki; Kawata, Yayoi; Noguchi, Toshihiro; Yamamoto, Syunsuke; Fujioka, Yasushi; Kundu, Mrinalkanti; Khamrai, Uttam; Nakayama, Masaharu; Nagisa, Yasutaka; Kasai, Shizuo; Maekawa, Tsuyoshi

    2016-06-01

    Melanin-concentrating hormone (MCH) is an attractive target for antiobesity agents, and numerous drug discovery programs are dedicated to finding small-molecule MCH receptor 1 (MCHR1) antagonists. We recently reported novel pyridine-2(1H)-ones as aliphatic amine-free MCHR1 antagonists that structurally featured an imidazo[1,2-a]pyridine-based bicyclic motif. To investigate imidazopyridine variants with lower basicity and less potential to inhibit cytochrome P450 3A4 (CYP3A4), we designed pyridine-2(1H)-ones bearing various less basic bicyclic motifs. Among these, a lead compound 6a bearing a 1H-benzimidazole motif showed comparable binding affinity to MCHR1 to the corresponding imidazopyridine derivative 1. Optimization of 6a afforded a series of potent thiophene derivatives (6q-u); however, most of these were found to cause time-dependent inhibition (TDI) of CYP3A4. As bioactivation of thiophenes to form sulfoxide or epoxide species was considered to be a major cause of CYP3A4 TDI, we introduced electron withdrawing groups on the thiophene and found that a CF3 group on the ring or a Cl adjacent to the sulfur atom helped prevent CYP3A4 TDI. Consequently, 4-[(5-chlorothiophen-2-yl)methoxy]-1-(2-cyclopropyl-1-methyl-1H-benzimidazol-6-yl)pyridin-2(1H)-one (6s) was identified as a potent MCHR1 antagonist without the risk of CYP3A4 TDI, which exhibited a promising safety profile including low CYP3A4 inhibition and exerted significant antiobesity effects in diet-induced obese F344 rats. Copyright © 2016 Elsevier Ltd. All rights reserved.

  3. Calmodulins from Schistosoma mansoni: Biochemical analysis and interaction with IQ-motifs from voltage-gated calcium channels.

    PubMed

    Thomas, Charlotte M; Timson, David J

    2018-05-17

    The trematode Schistosoma mansoni is a causative agent of schistosomiasis, the second most common parasitic disease of humans after malaria. Calcium homeostasis and calcium-mediated signalling pathways are of particular interest in this species. The drug of choice for treating schistosomiasis, praziquantel, disrupts the regulation of calcium uptake and there is interest in exploiting calcium-mediated processes for future drug discovery. Calmodulin is a calcium sensing protein, present in most eukaryotes. It is a critical regulator of processes as diverse as muscle contraction, cell division and, partly through interaction with voltage-gated calcium channels, intra-cellular calcium concentrations. S. mansoni expresses two highly similar calmodulins - SmCaM1 and SmCaM2. Both proteins interact with calcium, manganese, cadmium (II), iron (II) and lead ions in native gel electrophoresis. These ions also cause conformational changes in the proteins resulting in the exposure of a more hydrophobic surface (as demonstrated by anilinonaphthalene-8-sulfonate fluorescence assays). The proteins are primarily dimeric in the absence of calcium ions, but monomeric in the presence of this ion. Both SmCaM1 and SmCaM2 interact with a peptide corresponding to an IQ-motif derived from the α-subunit of the voltage-gated calcium channel SmCa v 1B (residues 1923-1945). Both proteins bound with slightly higher affinity in the presence of calcium ions. However, there was no difference between the affinities of the two proteins for the peptide. This interaction could be antagonised by chlorpromazine and trifluoperazine, but not praziquantel or thiamylal. Interestingly no interaction could be detected with the other three IQ-motifs identified in S. mansoni voltage-gated ion calcium channels. Copyright © 2018 Elsevier Ltd. All rights reserved.

  4. Comparative analysis of plant genomes allows the definition of the "Phytolongins": a novel non-SNARE longin domain protein family

    PubMed Central

    2009-01-01

    Background Subcellular trafficking is a hallmark of eukaryotic cells. Because of their pivotal role in the process, a great deal of attention has been paid to the SNARE proteins. Most R-SNAREs, or "longins", however, also possess a highly conserved, N-terminal fold. This "longin domain" is known to play multiple roles in regulating SNARE activity and targeting via interaction with other trafficking proteins. However, the diversity and complement of longins in eukaryotes is poorly understood. Results Our comparative genome survey identified a novel family of longin-related proteins, dubbed the "Phytolongins" because they are specific to land plants. Phytolongins share with longins the N-terminal longin domain and the C-terminal transmembrane domain; however, in the central region, the SNARE motif is replaced by a novel region. Phylogenetic analysis pinpoints the Phytolongins as a derivative of the plant specific VAMP72 longin sub-family and allows elucidation of Phytolongin evolution. Conclusion "Longins" have been defined as R-SNAREs composed of both a longin domain and a SNARE motif. However, expressed gene isoforms and splice variants of longins are examples of non-SNARE motif containing longins. The discovery of Phytolongins, a family of non-SNARE longin domain proteins, together with recent evidence on the conservation of the longin-like fold in proteins involved in both vesicle fusion (e.g. the Trs20 tether) and vesicle formation (e.g. σ and μ adaptin) highlight the importance of the longin-like domain in protein trafficking and suggest that it was one of the primordial building blocks of the eukaryotic membrane-trafficking machinery. PMID:19889231

  5. Crystal Structure of Faradaurate-279: Au279(SPh-tBu)84 Plasmonic Nanocrystal Molecules.

    PubMed

    Sakthivel, Naga Arjun; Theivendran, Shevanuja; Ganeshraj, Vigneshraja; Oliver, Allen G; Dass, Amala

    2017-11-01

    We report the discovery of an unprecedentedly large, 2.2 nm diameter, thiolate protected gold nanocrystal characterized by single crystal X-ray crystallography (sc-XRD), Au 279 (SPh-tBu) 84 named Faradaurate-279 (F-279) in honor of Michael Faraday's (1857) pioneering work on nanoparticles. F-279 nanocrystal has a core-shell structure containing a truncated octahedral core with bulk face-centered cubic-like arrangement, yet a nanomolecule with a precise number of metal atoms and thiolate ligands. The Au 279 S 84 geometry was established from a low-temperature 120 K sc-XRD study at 0.90 Å resolution. The atom counts in core-shell structure of Au 279 follows the mathematical formula for magic number shells: Au@Au 12 @Au 42 @Au 92 @Au 54 , which is further protected by a final shell of Au 48 . Au 249 core is protected by three types of staple motifs, namely: 30 bridging, 18 monomeric, and 6 dimeric staple motifs. Despite the presence of such diverse staple motifs, Au 279 S 84 structure has a chiral pseudo-D 3 symmetry. The core-shell structure can be viewed as nested, concentric polyhedra, containing a total of five forms of Archimedean solids. A comparison between the Au 279 and Au 309 cuboctahedral superatom model in shell-wise growth is illustrated. F-279 can be synthesized and isolated in high purity in milligram quantities using size exclusion chromatography, as evidenced by mass spectrometry. Electrospray ionization-mass spectrometry independently verifies the X-ray diffraction study based heavy atoms formula, Au 279 S 84 , and establishes the molecular formula with the complete ligands, namely, Au 279 (SPh-tBu) 84 . It is also the smallest gold nanocrystal to exhibit metallic behavior, with a surface plasmon resonance band around 510 nm.

  6. Discovery: an interactive resource for the rational selection and comparison of putative drug target proteins in malaria

    PubMed Central

    Joubert, Fourie; Harrison, Claudia M; Koegelenberg, Riaan J; Odendaal, Christiaan J; de Beer, Tjaart AP

    2009-01-01

    Background Up to half a billion human clinical cases of malaria are reported each year, resulting in about 2.7 million deaths, most of which occur in sub-Saharan Africa. Due to the over-and misuse of anti-malarials, widespread resistance to all the known drugs is increasing at an alarming rate. Rational methods to select new drug target proteins and lead compounds are urgently needed. The Discovery system provides data mining functionality on extensive annotations of five malaria species together with the human and mosquito hosts, enabling the selection of new targets based on multiple protein and ligand properties. Methods A web-based system was developed where researchers are able to mine information on malaria proteins and predicted ligands, as well as perform comparisons to the human and mosquito host characteristics. Protein features used include: domains, motifs, EC numbers, GO terms, orthologs, protein-protein interactions, protein-ligand interactions and host-pathogen interactions among others. Searching by chemical structure is also available. Results An in silico system for the selection of putative drug targets and lead compounds is presented, together with an example study on the bifunctional DHFR-TS from Plasmodium falciparum. Conclusion The Discovery system allows for the identification of putative drug targets and lead compounds in Plasmodium species based on the filtering of protein and chemical properties. PMID:19642978

  7. Oak Ridge Graph Analytics for Medical Innovation (ORiGAMI)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Roberts, Larry W.; Lee, Sangkeun

    2016-01-01

    In this era of data-driven decisions and discovery where Big Data is producing Bigger Data, data scientists at the Oak Ridge National Laboratory are leveraging unique leadership infrastructure (e.g., Urika XA and Urika GD appliances) to develop scalable algorithms for semantic, logical and statistical reasoning with Big Data (i.e., data stored in databases as well as unstructured data in documents). ORiGAMI is a next-generation knowledge-discovery framework that is: (a) knowledge nurturing (i.e., evolves seamlessly with newer knowledge and data), (b) smart and curious (i.e. using information-foraging and reasoning algorithms to digest content) and (c) synergistic (i.e., interfaces computers with whatmore » they do best to help subject-matter-experts do their best. ORiGAMI has been demonstrated using the National Library of Medicine's SEMANTIC MEDLINE (archive of medical knowledge since 1994).« less

  8. k-merSNP discovery: Software for alignment-and reference-free scalable SNP discovery, phylogenetics, and annotation for hundreds of microbial genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and treesmore » determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.« less

  9. iCOSSY: An Online Tool for Context-Specific Subnetwork Discovery from Gene Expression Data

    PubMed Central

    Saha, Ashis; Jeon, Minji; Tan, Aik Choon; Kang, Jaewoo

    2015-01-01

    Pathway analyses help reveal underlying molecular mechanisms of complex biological phenotypes. Biologists tend to perform multiple pathway analyses on the same dataset, as there is no single answer. It is often inefficient for them to implement and/or install all the algorithms by themselves. Online tools can help the community in this regard. Here we present an online gene expression analytical tool called iCOSSY which implements a novel pathway-based COntext-specific Subnetwork discoverY (COSSY) algorithm. iCOSSY also includes a few modifications of COSSY to increase its reliability and interpretability. Users can upload their gene expression datasets, and discover important subnetworks of closely interacting molecules to differentiate between two phenotypes (context). They can also interactively visualize the resulting subnetworks. iCOSSY is a web server that finds subnetworks that are differentially expressed in two phenotypes. Users can visualize the subnetworks to understand the biology of the difference. PMID:26147457

  10. CompariMotif: quick and easy comparisons of sequence motifs.

    PubMed

    Edwards, Richard J; Davey, Norman E; Shields, Denis C

    2008-05-15

    CompariMotif is a novel tool for making motif-motif comparisons, identifying and describing similarities between regular expression motifs. CompariMotif can identify a number of different relationships between motifs, including exact matches, variants of degenerate motifs and complex overlapping motifs. Motif relationships are scored using shared information content, allowing the best matches to be easily identified in large comparisons. Many input and search options are available, enabling a list of motifs to be compared to itself (to identify recurring motifs) or to datasets of known motifs. CompariMotif can be run online at http://bioware.ucd.ie/ and is freely available for academic use as a set of open source Python modules under a GNU General Public License from http://bioinformatics.ucd.ie/shields/software/comparimotif/

  11. Enabling the Discovery of Recurring Anomalies in Aerospace System Problem Reports using High-Dimensional Clustering Techniques

    NASA Technical Reports Server (NTRS)

    Srivastava, Ashok, N.; Akella, Ram; Diev, Vesselin; Kumaresan, Sakthi Preethi; McIntosh, Dawn M.; Pontikakis, Emmanuel D.; Xu, Zuobing; Zhang, Yi

    2006-01-01

    This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining techniques to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant importance in the aviation industry. The first problem is that of automatic anomaly discovery about an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described m different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact the special nature of recurring anomalies (very small cluster sizes) requires incorporating new methods and measures to enhance the original approach for anomaly detection. ?& pant 0-

  12. Systems biology by the rules: hybrid intelligent systems for pathway modeling and discovery.

    PubMed

    Bosl, William J

    2007-02-15

    Expert knowledge in journal articles is an important source of data for reconstructing biological pathways and creating new hypotheses. An important need for medical research is to integrate this data with high throughput sources to build useful models that span several scales. Researchers traditionally use mental models of pathways to integrate information and development new hypotheses. Unfortunately, the amount of information is often overwhelming and these are inadequate for predicting the dynamic response of complex pathways. Hierarchical computational models that allow exploration of semi-quantitative dynamics are useful systems biology tools for theoreticians, experimentalists and clinicians and may provide a means for cross-communication. A novel approach for biological pathway modeling based on hybrid intelligent systems or soft computing technologies is presented here. Intelligent hybrid systems, which refers to several related computing methods such as fuzzy logic, neural nets, genetic algorithms, and statistical analysis, has become ubiquitous in engineering applications for complex control system modeling and design. Biological pathways may be considered to be complex control systems, which medicine tries to manipulate to achieve desired results. Thus, hybrid intelligent systems may provide a useful tool for modeling biological system dynamics and computational exploration of new drug targets. A new modeling approach based on these methods is presented in the context of hedgehog regulation of the cell cycle in granule cells. Code and input files can be found at the Bionet website: www.chip.ord/~wbosl/Software/Bionet. This paper presents the algorithmic methods needed for modeling complicated biochemical dynamics using rule-based models to represent expert knowledge in the context of cell cycle regulation and tumor growth. A notable feature of this modeling approach is that it allows biologists to build complex models from their knowledge base without the need to translate that knowledge into mathematical form. Dynamics on several levels, from molecular pathways to tissue growth, are seamlessly integrated. A number of common network motifs are examined and used to build a model of hedgehog regulation of the cell cycle in cerebellar neurons, which is believed to play a key role in the etiology of medulloblastoma, a devastating childhood brain cancer.

  13. A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning.

    PubMed

    Sun, Yahui; Hameed, Pathima Nusrath; Verspoor, Karin; Halgamuge, Saman

    2016-12-05

    Drug repositioning can reduce the time, costs and risks of drug development by identifying new therapeutic effects for known drugs. It is challenging to reposition drugs as pharmacological data is large and complex. Subnetwork identification has already been used to simplify the visualization and interpretation of biological data, but it has not been applied to drug repositioning so far. In this paper, we fill this gap by proposing a new Physarum-inspired Prize-Collecting Steiner Tree algorithm to identify subnetworks for drug repositioning. Drug Similarity Networks (DSN) are generated using the chemical, therapeutic, protein, and phenotype features of drugs. In DSNs, vertex prizes and edge costs represent the similarities and dissimilarities between drugs respectively, and terminals represent drugs in the cardiovascular class, as defined in the Anatomical Therapeutic Chemical classification system. A new Physarum-inspired Prize-Collecting Steiner Tree algorithm is proposed in this paper to identify subnetworks. We apply both the proposed algorithm and the widely-used GW algorithm to identify subnetworks in our 18 generated DSNs. In these DSNs, our proposed algorithm identifies subnetworks with an average Rand Index of 81.1%, while the GW algorithm can only identify subnetworks with an average Rand Index of 64.1%. We select 9 subnetworks with high Rand Index to find drug repositioning opportunities. 10 frequently occurring drugs in these subnetworks are identified as candidates to be repositioned for cardiovascular diseases. We find evidence to support previous discoveries that nitroglycerin, theophylline and acarbose may be able to be repositioned for cardiovascular diseases. Moreover, we identify seven previously unknown drug candidates that also may interact with the biological cardiovascular system. These discoveries show our proposed Prize-Collecting Steiner Tree approach as a promising strategy for drug repositioning.

  14. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method

    PubMed Central

    Nielsen, Morten; Lundegaard, Claus; Lund, Ole

    2007-01-01

    Background Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles. Results The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR), we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors. Conclusion The SMM-align method was shown to outperform other state of the art MHC class II prediction methods. The method predicts quantitative peptide:MHC binding affinity values, making it ideally suited for rational epitope discovery. The method has been trained and evaluated on the, to our knowledge, largest benchmark data set publicly available and covers the nine HLA-DR supertypes suggested as well as three mouse H2-IA allele. Both the peptide benchmark data set, and SMM-align prediction method (NetMHCII) are made publicly available. PMID:17608956

  15. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method.

    PubMed

    Nielsen, Morten; Lundegaard, Claus; Lund, Ole

    2007-07-04

    Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles. The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR), we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors. The SMM-align method was shown to outperform other state of the art MHC class II prediction methods. The method predicts quantitative peptide:MHC binding affinity values, making it ideally suited for rational epitope discovery. The method has been trained and evaluated on the, to our knowledge, largest benchmark data set publicly available and covers the nine HLA-DR supertypes suggested as well as three mouse H2-IA allele. Both the peptide benchmark data set, and SMM-align prediction method (NetMHCII) are made publicly available.

  16. Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data

    DTIC Science & Technology

    2015-07-01

    Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data Guy Van den Broeck∗ and Karthika Mohan∗ and Arthur Choi and Adnan ...notwithstanding any other provision of law , no person shall be subject to a penalty for failing to comply with a collection of information if it does...Wasserman, L. (2011). All of Statistics. Springer Science & Business Media. Yaramakala, S., & Margaritis, D. (2005). Speculative markov blanket discovery for optimal feature selection. In Proceedings of ICDM.

  17. Molecular population dynamics of DNA structures in a bcl-2 promoter sequence is regulated by small molecules and the transcription factor hnRNP LL.

    PubMed

    Cui, Yunxi; Koirala, Deepak; Kang, HyunJin; Dhakal, Soma; Yangyuoru, Philip; Hurley, Laurence H; Mao, Hanbin

    2014-05-01

    Minute difference in free energy change of unfolding among structures in an oligonucleotide sequence can lead to a complex population equilibrium, which is rather challenging for ensemble techniques to decipher. Herein, we introduce a new method, molecular population dynamics (MPD), to describe the intricate equilibrium among non-B deoxyribonucleic acid (DNA) structures. Using mechanical unfolding in laser tweezers, we identified six DNA species in a cytosine (C)-rich bcl-2 promoter sequence. Population patterns of these species with and without a small molecule (IMC-76 or IMC-48) or the transcription factor hnRNP LL are compared to reveal the MPD of different species. With a pattern recognition algorithm, we found that IMC-48 and hnRNP LL share 80% similarity in stabilizing i-motifs with 60 s incubation. In contrast, IMC-76 demonstrates an opposite behavior, preferring flexible DNA hairpins. With 120-180 s incubation, IMC-48 and hnRNP LL destabilize i-motifs, which has been previously proposed to activate bcl-2 transcriptions. These results provide strong support, from the population equilibrium perspective, that small molecules and hnRNP LL can modulate bcl-2 transcription through interaction with i-motifs. The excellent agreement with biochemical results firmly validates the MPD analyses, which, we expect, can be widely applicable to investigate complex equilibrium of biomacromolecules. © 2014 The Author(s). Published by Oxford University Press [on behalf of Nucleic Acids Research].

  18. Better cancer biomarker discovery through better study design.

    PubMed

    Rundle, Andrew; Ahsan, Habibul; Vineis, Paolo

    2012-12-01

    High-throughput laboratory technologies coupled with sophisticated bioinformatics algorithms have tremendous potential for discovering novel biomarkers, or profiles of biomarkers, that could serve as predictors of disease risk, response to treatment or prognosis. We discuss methodological issues in wedding high-throughput approaches for biomarker discovery with the case-control study designs typically used in biomarker discovery studies, especially focusing on nested case-control designs. We review principles for nested case-control study design in relation to biomarker discovery studies and describe how the efficiency of biomarker discovery can be effected by study design choices. We develop a simulated prostate cancer cohort data set and a series of biomarker discovery case-control studies nested within the cohort to illustrate how study design choices can influence biomarker discovery process. Common elements of nested case-control design, incidence density sampling and matching of controls to cases are not typically factored correctly into biomarker discovery analyses, inducing bias in the discovery process. We illustrate how incidence density sampling and matching of controls to cases reduce the apparent specificity of truly valid biomarkers 'discovered' in a nested case-control study. We also propose and demonstrate a new case-control matching protocol, we call 'antimatching', that improves the efficiency of biomarker discovery studies. For a valid, but as yet undiscovered, biomarker(s) disjunctions between correctly designed epidemiologic studies and the practice of biomarker discovery reduce the likelihood that true biomarker(s) will be discovered and increases the false-positive discovery rate. © 2012 The Authors. European Journal of Clinical Investigation © 2012 Stichting European Society for Clinical Investigation Journal Foundation.

  19. Two-way learning with one-way supervision for gene expression data.

    PubMed

    Wong, Monica H T; Mutch, David M; McNicholas, Paul D

    2017-03-04

    A family of parsimonious Gaussian mixture models for the biclustering of gene expression data is introduced. Biclustering is accommodated by adopting a mixture of factor analyzers model with a binary, row-stochastic factor loadings matrix. This particular form of factor loadings matrix results in a block-diagonal covariance matrix, which is a useful property in gene expression analyses, specifically in biomarker discovery scenarios where blood can potentially act as a surrogate tissue for other less accessible tissues. Prior knowledge of the factor loadings matrix is useful in this application and is reflected in the one-way supervised nature of the algorithm. Additionally, the factor loadings matrix can be assumed to be constant across all components because of the relationship desired between the various types of tissue samples. Parameter estimates are obtained through a variant of the expectation-maximization algorithm and the best-fitting model is selected using the Bayesian information criterion. The family of models is demonstrated using simulated data and two real microarray data sets. The first real data set is from a rat study that investigated the influence of diabetes on gene expression in different tissues. The second real data set is from a human transcriptomics study that focused on blood and immune tissues. The microarray data sets illustrate the biclustering family's performance in biomarker discovery involving peripheral blood as surrogate biopsy material. The simulation studies indicate that the algorithm identifies the correct biclusters, most optimally when the number of observation clusters is known. Moreover, the biclustering algorithm identified biclusters comprised of biologically meaningful data related to insulin resistance and immune function in the rat and human real data sets, respectively. Initial results using real data show that this biclustering technique provides a novel approach for biomarker discovery by enabling blood to be used as a surrogate for hard-to-obtain tissues.

  20. DASS: efficient discovery and p-value calculation of substructures in unordered data.

    PubMed

    Hollunder, Jens; Friedel, Maik; Beyer, Andreas; Workman, Christopher T; Wilhelm, Thomas

    2007-01-01

    Pattern identification in biological sequence data is one of the main objectives of bioinformatics research. However, few methods are available for detecting patterns (substructures) in unordered datasets. Data mining algorithms mainly developed outside the realm of bioinformatics have been adapted for that purpose, but typically do not determine the statistical significance of the identified patterns. Moreover, these algorithms do not exploit the often modular structure of biological data. We present the algorithm DASS (Discovery of All Significant Substructures) that first identifies all substructures in unordered data (DASS(Sub)) in a manner that is especially efficient for modular data. In addition, DASS calculates the statistical significance of the identified substructures, for sets with at most one element of each type (DASS(P(set))), or for sets with multiple occurrence of elements (DASS(P(mset))). The power and versatility of DASS is demonstrated by four examples: combinations of protein domains in multi-domain proteins, combinations of proteins in protein complexes (protein subcomplexes), combinations of transcription factor target sites in promoter regions and evolutionarily conserved protein interaction subnetworks. The program code and additional data are available at http://www.fli-leibniz.de/tsb/DASS

  1. Human versus Robots in the Discovery and Crystallization of Gigantic Polyoxometalates.

    PubMed

    Duros, Vasilios; Grizou, Jonathan; Xuan, Weimin; Hosni, Zied; Long, De-Liang; Miras, Haralampos N; Cronin, Leroy

    2017-08-28

    The discovery of new gigantic molecules formed by self-assembly and crystal growth is challenging as it combines two contingent events; first is the formation of a new molecule, and second its crystallization. Herein, we construct a workflow that can be followed manually or by a robot to probe the envelope of both events and employ it for a new polyoxometalate cluster, Na 6 [Mo 120 Ce 6 O 366 H 12 (H 2 O) 78 ]⋅200 H 2 O (1) which has a trigonal-ring type architecture (yield 4.3 % based on Mo). Its synthesis and crystallization was probed using an active machine-learning algorithm developed by us to explore the crystallization space, the algorithm results were compared with those obtained by human experimenters. The algorithm-based search is able to cover ca. 9 times more crystallization space than a random search and ca. 6 times more than humans and increases the crystallization prediction accuracy to 82.4±0.7 % over 77.1±0.9 % from human experimenters. © 2017 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA.

  2. Discovery of error-tolerant biclusters from noisy gene expression data.

    PubMed

    Gupta, Rohit; Rao, Navneet; Kumar, Vipin

    2011-11-24

    An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together. In this paper, we first propose a novel error-tolerant biclustering model, 'ET-bicluster', and then propose a bottom-up heuristic-based mining algorithm to sequentially discover error-tolerant biclusters directly from real-valued gene-expression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two real-valued S.Cerevisiae microarray gene-expression data sets are used to demonstrate that the biclusters obtained from ET-bicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GO-based functional enrichment analysis. The statistical significance of the discovered error-tolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four real-valued Breast Cancer microarray gene-expression data sets and evaluate the biomarkers obtained using MSigDB gene sets. The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ET-bicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from gene-expression data.

  3. A Novel Family in Medicago truncatula Consisting of More Than 300 Nodule-Specific Genes Coding for Small, Secreted Polypeptides with Conserved Cysteine Motifs1[w

    PubMed Central

    Mergaert, Peter; Nikovics, Krisztina; Kelemen, Zsolt; Maunoury, Nicolas; Vaubert, Danièle; Kondorosi, Adam; Kondorosi, Eva

    2003-01-01

    Transcriptome analysis of Medicago truncatula nodules has led to the discovery of a gene family named NCR (nodule-specific cysteine rich) with more than 300 members. The encoded polypeptides were short (60–90 amino acids), carried a conserved signal peptide, and, except for a conserved cysteine motif, displayed otherwise extensive sequence divergence. Family members were found in pea (Pisum sativum), broad bean (Vicia faba), white clover (Trifolium repens), and Galega orientalis but not in other plants, including other legumes, suggesting that the family might be specific for galegoid legumes forming indeterminate nodules. Gene expression of all family members was restricted to nodules except for two, also expressed in mycorrhizal roots. NCR genes exhibited distinct temporal and spatial expression patterns in nodules and, thus, were coupled to different stages of development. The signal peptide targeted the polypeptides in the secretory pathway, as shown by green fluorescent protein fusions expressed in onion (Allium cepa) epidermal cells. Coregulation of certain NCR genes with genes coding for a potentially secreted calmodulin-like protein and for a signal peptide peptidase suggests a concerted action in nodule development. Potential functions of the NCR polypeptides in cell-to-cell signaling and creation of a defense system are discussed. PMID:12746522

  4. Genome-wide characterization of GRAS family genes in Medicago truncatula reveals their evolutionary dynamics and functional diversification

    PubMed Central

    Zhang, Hailing; Cao, Yingping; Shang, Chen; Li, Jikai; Wang, Jianli; Wu, Zhenying; Ma, Lichao; Qi, Tianxiong; Fu, Chunxiang; Hu, Baozhong

    2017-01-01

    The GRAS gene family is a large plant-specific family of transcription factors that are involved in diverse processes during plant development. Medicago truncatula is an ideal model plant for genetic research in legumes, and specifically for studying nodulation, which is crucial for nitrogen fixation. In this study, 59 MtGRAS genes were identified and classified into eight distinct subgroups based on phylogenetic relationships. Motifs located in the C-termini were conserved across the subgroups, while motifs in the N-termini were subfamily specific. Gene duplication was the main evolutionary force for MtGRAS expansion, especially proliferation of the LISCL subgroup. Seventeen duplicated genes showed strong effects of purifying selection and diverse expression patterns, highlighting their functional importance and diversification after duplication. Thirty MtGRAS genes, including NSP1 and NSP2, were preferentially expressed in nodules, indicating possible roles in the process of nodulation. A transcriptome study, combined with gene expression analysis under different stress conditions, suggested potential functions of MtGRAS genes in various biological pathways and stress responses. Taken together, these comprehensive analyses provide basic information for understanding the potential functions of GRAS genes, and will facilitate further discovery of MtGRAS gene functions. PMID:28945786

  5. Structures and magnetic properties of Co-Zr-B magnets studied by first-principles calculations

    DOE PAGES

    Zhao, Xin; Ke, Liqin; Nguyen, Manh Cuong; ...

    2015-06-23

    The structures and magnetic properties of Co-Zr-B alloys near the composition of Co 5Zr with B at. % ≤6% were studied using adaptive genetic algorithm and first-principles calculations. The energy and magnetic moment contour maps as a function of chemical composition were constructed for the Co-Zr-B magnet alloys through extensive structure searches and calculations. We found that Co-Zr-B system exhibits the same structure motif as the “Co 11Zr 2” polymorphs, and such motif plays a key role in achieving strong magnetic anisotropy. Boron atoms were found to be able to substitute cobalt atoms or occupy the “interruption” sites. First-principles calculationsmore » showed that the magnetocrystalline anisotropy energies of the boron-doped alloys are close to that of the high-temperature rhombohedral Co 5Zr phase and larger than that of the low-temperature Co 5.25Zr phase. As a result, our calculations provide useful guidelines for further experimental optimization of the magnetic performances of these alloys.« less

  6. Medicinal chemistry in drug discovery in big pharma: past, present and future.

    PubMed

    Campbell, Ian B; Macdonald, Simon J F; Procopiou, Panayiotis A

    2018-02-01

    The changes in synthetic and medicinal chemistry and related drug discovery science as practiced in big pharma over the past few decades are described. These have been predominantly driven by wider changes in society namely the computer, internet and globalisation. Thoughts about the future of medicinal chemistry are also discussed including sharing the risks and costs of drug discovery and the future of outsourcing. The continuing impact of access to substantial computing power and big data, the use of algorithms in data analysis and drug design are also presented. The next generation of medicinal chemists will communicate in ways that reflect social media and the results of constantly being connected to each other and data. Copyright © 2017. Published by Elsevier Ltd.

  7. The discovery of the causes of leprosy: A computational analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Corruble, V.; Ganascia, J.G.

    1996-12-31

    The role played by the inductive inference has been studied extensively in the field of Scientific Discovery. The work presented here tackles the problem of induction in medical research. The discovery of the causes of leprosy is analyzed and simulated using computational means. An inductive algorithm is proposed, which is successful in simulating some essential steps in the progress of the understanding of the disease. It also allows us to simulate the false reasoning of previous centuries through the introduction of some medical a priori inherited form archaic medicine. Corroborating previous research, this problem illustrates the importance of the socialmore » and cultural environment on the way the inductive inference is performed in medicine.« less

  8. Computational discovery of pathway-level genetic vulnerabilities in non-small-cell lung cancer | Office of Cancer Genomics

    Cancer.gov

    Novel approaches are needed for discovery of targeted therapies for non-small-cell lung cancer (NSCLC) that are specific to certain patients. Whole genome RNAi screening of lung cancer cell lines provides an ideal source for determining candidate drug targets. Unsupervised learning algorithms uncovered patterns of differential vulnerability across lung cancer cell lines to loss of functionally related genes. Such genetic vulnerabilities represent candidate targets for therapy and are found to be involved in splicing, translation and protein folding.

  9. Knowledge discovery from structured mammography reports using inductive logic programming.

    PubMed

    Burnside, Elizabeth S; Davis, Jesse; Costa, Victor Santos; Dutra, Inês de Castro; Kahn, Charles E; Fine, Jason; Page, David

    2005-01-01

    The development of large mammography databases provides an opportunity for knowledge discovery and data mining techniques to recognize patterns not previously appreciated. Using a database from a breast imaging practice containing patient risk factors, imaging findings, and biopsy results, we tested whether inductive logic programming (ILP) could discover interesting hypotheses that could subsequently be tested and validated. The ILP algorithm discovered two hypotheses from the data that were 1) judged as interesting by a subspecialty trained mammographer and 2) validated by analysis of the data itself.

  10. Semi-automated surface mapping via unsupervised classification

    NASA Astrophysics Data System (ADS)

    D'Amore, M.; Le Scaon, R.; Helbert, J.; Maturilli, A.

    2017-09-01

    Due to the increasing volume of the returned data from space mission, the human search for correlation and identification of interesting features becomes more and more unfeasible. Statistical extraction of features via machine learning methods will increase the scientific output of remote sensing missions and aid the discovery of yet unknown feature hidden in dataset. Those methods exploit algorithm trained on features from multiple instrument, returning classification maps that explore intra-dataset correlation, allowing for the discovery of unknown features. We present two applications, one for Mercury and one for Vesta.

  11. A Metadata based Knowledge Discovery Methodology for Seeding Translational Research.

    PubMed

    Kothari, Cartik R; Payne, Philip R O

    2015-01-01

    In this paper, we present a semantic, metadata based knowledge discovery methodology for identifying teams of researchers from diverse backgrounds who can collaborate on interdisciplinary research projects: projects in areas that have been identified as high-impact areas at The Ohio State University. This methodology involves the semantic annotation of keywords and the postulation of semantic metrics to improve the efficiency of the path exploration algorithm as well as to rank the results. Results indicate that our methodology can discover groups of experts from diverse areas who can collaborate on translational research projects.

  12. The guitar chord-generating algorithm based on complex network

    NASA Astrophysics Data System (ADS)

    Ren, Tao; Wang, Yi-fan; Du, Dan; Liu, Miao-miao; Siddiqi, Awais

    2016-02-01

    This paper aims to generate chords for popular songs automatically based on complex network. Firstly, according to the characteristics of guitar tablature, six chord networks of popular songs by six pop singers are constructed and the properties of all networks are concluded. By analyzing the diverse chord networks, the accompaniment regulations and features are shown, with which the chords can be generated automatically. Secondly, in terms of the characteristics of popular songs, a two-tiered network containing a verse network and a chorus network is constructed. With this network, the verse and chorus can be composed respectively with the random walk algorithm. Thirdly, the musical motif is considered for generating chords, with which the bad chord progressions can be revised. This method can make the accompaniments sound more melodious. Finally, a popular song is chosen for generating chords and the new generated accompaniment sounds better than those done by the composers.

  13. Improving Allergen Prediction in Main Crops Using a Weighted Integrative Method.

    PubMed

    Li, Jing; Wang, Jing; Li, Jing

    2017-12-01

    As a public health problem, food allergy is frequently caused by food allergy proteins, which trigger a type-I hypersensitivity reaction in the immune system of atopic individuals. The food allergens in our daily lives are mainly from crops including rice, wheat, soybean and maize. However, allergens in these main crops are far from fully uncovered. Although some bioinformatics tools or methods predicting the potential allergenicity of proteins have been proposed, each method has their limitation. In this paper, we built a novel algorithm PREAL W , which integrated PREAL, FAO/WHO criteria and motif-based method by a weighted average score, to benefit the advantages of different methods. Our results illustrated PREAL W has better performance significantly in the crops' allergen prediction. This integrative allergen prediction algorithm could be useful for critical food safety matters. The PREAL W could be accessed at http://lilab.life.sjtu.edu.cn:8080/prealw .

  14. An experimental and computational evolution-based method to study a mode of co-evolution of overlapping open reading frames in the AAV2 viral genome.

    PubMed

    Kawano, Yasuhiro; Neeley, Shane; Adachi, Kei; Nakai, Hiroyuki

    2013-01-01

    Overlapping open reading frames (ORFs) in viral genomes undergo co-evolution; however, how individual amino acids coded by overlapping ORFs are structurally, functionally, and co-evolutionarily constrained remains difficult to address by conventional homologous sequence alignment approaches. We report here a new experimental and computational evolution-based methodology to address this question and report its preliminary application to elucidating a mode of co-evolution of the frame-shifted overlapping ORFs in the adeno-associated virus (AAV) serotype 2 viral genome. These ORFs encode both capsid VP protein and non-structural assembly-activating protein (AAP). To show proof of principle of the new method, we focused on the evolutionarily conserved QVKEVTQ and KSKRSRR motifs, a pair of overlapping heptapeptides in VP and AAP, respectively. In the new method, we first identified a large number of capsid-forming VP3 mutants and functionally competent AAP mutants of these motifs from mutant libraries by experimental directed evolution under no co-evolutionary constraints. We used Illumina sequencing to obtain a large dataset and then statistically assessed the viability of VP and AAP heptapeptide mutants. The obtained heptapeptide information was then integrated into an evolutionary algorithm, with which VP and AAP were co-evolved from random or native nucleotide sequences in silico. As a result, we demonstrate that these two heptapeptide motifs could exhibit high degeneracy if coded by separate nucleotide sequences, and elucidate how overlap-evoked co-evolutionary constraints play a role in making the VP and AAP heptapeptide sequences into the present shape. Specifically, we demonstrate that two valine (V) residues and β-strand propensity in QVKEVTQ are structurally important, the strongly negative and hydrophilic nature of KSKRSRR is functionally important, and overlap-evoked co-evolution imposes strong constraints on serine (S) residues in KSKRSRR, despite high degeneracy of the motifs in the absence of co-evolutionary constraints.

  15. CisMiner: Genome-Wide In-Silico Cis-Regulatory Module Prediction by Fuzzy Itemset Mining

    PubMed Central

    Navarro, Carmen; Lopez, Francisco J.; Cano, Carlos; Garcia-Alcalde, Fernando; Blanco, Armando

    2014-01-01

    Eukaryotic gene control regions are known to be spread throughout non-coding DNA sequences which may appear distant from the gene promoter. Transcription factors are proteins that coordinately bind to these regions at transcription factor binding sites to regulate gene expression. Several tools allow to detect significant co-occurrences of closely located binding sites (cis-regulatory modules, CRMs). However, these tools present at least one of the following limitations: 1) scope limited to promoter or conserved regions of the genome; 2) do not allow to identify combinations involving more than two motifs; 3) require prior information about target motifs. In this work we present CisMiner, a novel methodology to detect putative CRMs by means of a fuzzy itemset mining approach able to operate at genome-wide scale. CisMiner allows to perform a blind search of CRMs without any prior information about target CRMs nor limitation in the number of motifs. CisMiner tackles the combinatorial complexity of genome-wide cis-regulatory module extraction using a natural representation of motif combinations as itemsets and applying the Top-Down Fuzzy Frequent- Pattern Tree algorithm to identify significant itemsets. Fuzzy technology allows CisMiner to better handle the imprecision and noise inherent to regulatory processes. Results obtained for a set of well-known binding sites in the S. cerevisiae genome show that our method yields highly reliable predictions. Furthermore, CisMiner was also applied to putative in-silico predicted transcription factor binding sites to identify significant combinations in S. cerevisiae and D. melanogaster, proving that our approach can be further applied genome-wide to more complex genomes. CisMiner is freely accesible at: http://genome2.ugr.es/cisminer. CisMiner can be queried for the results presented in this work and can also perform a customized cis-regulatory module prediction on a query set of transcription factor binding sites provided by the user. PMID:25268582

  16. Molecular thinking for nanoplasmonic design.

    PubMed

    Guerrero-Martínez, Andrés; Grzelczak, Marek; Liz-Marzán, Luis M

    2012-05-22

    The development of nanoplasmonics has been tremendous during the past two decades, driven in part by the improvements in colloidal synthesis of nanocrystals and manipulation of nanoparticle surface functionalities. This has granted access not only to exquisite control over the morphology of nanoparticles but also to novel multiparticle nanostructures with a variety of organizational motifs. Driven by such new possibilities, completely unforeseen plasmonic effects have been found, which let us think about applications in a variety of fields. In this Perspective, we discuss the evolution of plasmonic nanomaterials and their corresponding properties and correlations with molecular concepts that have been around for a long time. Additional thinking along these lines may lead to further expansion of nanoplasmonics and to multiple surprising discoveries in this field.

  17. [Elaboration of Pseudo-natural Products Using Artificial In Vitro Biosynthesis Systems].

    PubMed

    Goto, Yuki

    2018-01-01

     Peptidic natural products often consist of not only proteinogenic building blocks but also unique non-proteinogenic structures such as macrocyclic scaffolds and N-methylated backbones. Since such non-proteinogenic structures are important structural motifs that contribute to diverse bioactivity, we have proposed that peptides with non-proteinogenic structures should be attractive candidates as artificial bioactive peptides mimicking natural products, or so-called pseudo-natural products. We previously devised an engineered translation system for pseudo-natural peptides, referred to as the flexible in vitro translation (FIT) system. This system enabled "one-pot" synthesis of highly diverse pseudo-natural peptide libraries, which can be rapidly screened by mRNA display technology for the discovery of pseudo-natural peptides with diverse bioactivities.

  18. GIV/Girdin transmits signals from multiple receptors by triggering trimeric G protein activation.

    PubMed

    Garcia-Marcos, Mikel; Ghosh, Pradipta; Farquhar, Marilyn G

    2015-03-13

    Activation of trimeric G proteins has been traditionally viewed as the exclusive job of G protein-coupled receptors (GPCRs). This view has been challenged by the discovery of non-receptor activators of trimeric G proteins. Among them, GIV (a.k.a. Girdin) is the first for which a guanine nucleotide exchange factor (GEF) activity has been unequivocally associated with a well defined motif. Here we discuss how GIV assembles alternative signaling pathways by sensing cues from various classes of surface receptors and relaying them via G protein activation. We also describe the dysregulation of this mechanism in disease and how its targeting holds promise for novel therapeutics. © 2015 by The American Society for Biochemistry and Molecular Biology, Inc.

  19. Transcriptional and chromatin regulation during fasting – The genomic era

    PubMed Central

    Goldstein, Ido; Hager, Gordon L.

    2015-01-01

    An elaborate metabolic response to fasting is orchestrated by the liver and is heavily reliant upon transcriptional regulation. In response to hormones (glucagon, glucocorticoids) many transcription factors (TFs) are activated and regulate various genes involved in metabolic pathways aimed at restoring homeostasis: gluconeogenesis, fatty acid oxidation, ketogenesis and amino acid shuttling. We summarize the recent discoveries regarding fasting-related TFs with an emphasis on genome-wide binding patterns. Collectively, the summarized findings reveal a large degree of co-operation between TFs during fasting which occurs at motif-rich DNA sites bound by a combination of TFs. These new findings implicate transcriptional and chromatin regulation as major determinants of the response to fasting and unravels the complex, multi-TF nature of this response. PMID:26520657

  20. [Prediction of Promoter Motifs in Virophages].

    PubMed

    Gong, Chaowen; Zhou, Xuewen; Pan, Yingjie; Wang, Yongjie

    2015-07-01

    Virophages have crucial roles in ecosystems and are the transport vectors of genetic materials. To shed light on regulation and control mechanisms in virophage--host systems as well as evolution between virophages and their hosts, the promoter motifs of virophages were predicted on the upstream regions of start codons using an analytical tool for prediction of promoter motifs: Multiple EM for Motif Elicitation. Seventeen potential promoter motifs were identified based on the E-value, location, number and length of promoters in genomes. Sputnik and zamilon motif 2 with AT-rich regions were distributed widely on genomes, suggesting that these motifs may be associated with regulation of the expression of various genes. Motifs containing the TCTA box were predicted to be late promoter motif in mavirus; motifs containing the ATCT box were the potential late promoter motif in the Ace Lake mavirus . AT-rich regions were identified on motif 2 in the Organic Lake virophage, motif 3 in Yellowstone Lake virophage (YSLV)1 and 2, motif 1 in YSLV3, and motif 1 and 2 in YSLV4, respectively. AT-rich regions were distributed widely on the genomes of virophages. All of these motifs may be promoter motifs of virophages. Our results provide insights into further exploration of temporal expression of genes in virophages as well as associations between virophages and giant viruses.

  1. TRStalker: an efficient heuristic for finding fuzzy tandem repeats.

    PubMed

    Pellegrini, Marco; Renda, M Elena; Vecchio, Alessio

    2010-06-15

    Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events. We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the 'generalized median string' that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences. TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it. Supplementary data are available at Bioinformatics online.

  2. PEA: an integrated R toolkit for plant epitranscriptome analysis.

    PubMed

    Zhai, Jingjing; Song, Jie; Cheng, Qian; Tang, Yunjia; Ma, Chuang

    2018-05-29

    The epitranscriptome, also known as chemical modifications of RNA (CMRs), is a newly discovered layer of gene regulation, the biological importance of which emerged through analysis of only a small fraction of CMRs detected by high-throughput sequencing technologies. Understanding of the epitranscriptome is hampered by the absence of computational tools for the systematic analysis of epitranscriptome sequencing data. In addition, no tools have yet been designed for accurate prediction of CMRs in plants, or to extend epitranscriptome analysis from a fraction of the transcriptome to its entirety. Here, we introduce PEA, an integrated R toolkit to facilitate the analysis of plant epitranscriptome data. The PEA toolkit contains a comprehensive collection of functions required for read mapping, CMR calling, motif scanning and discovery, and gene functional enrichment analysis. PEA also takes advantage of machine learning technologies for transcriptome-scale CMR prediction, with high prediction accuracy, using the Positive Samples Only Learning algorithm, which addresses the two-class classification problem by using only positive samples (CMRs), in the absence of negative samples (non-CMRs). Hence PEA is a versatile epitranscriptome analysis pipeline covering CMR calling, prediction, and annotation, and we describe its application to predict N6-methyladenosine (m6A) modifications in Arabidopsis thaliana. Experimental results demonstrate that the toolkit achieved 71.6% sensitivity and 73.7% specificity, which is superior to existing m6A predictors. PEA is potentially broadly applicable to the in-depth study of epitranscriptomics. PEA Docker image is available at https://hub.docker.com/r/malab/pea, source codes and user manual are available at https://github.com/cma2015/PEA. chuangma2006@gmail.com. Supplementary data are available at Bioinformatics online.

  3. Contextual Refinement of Regulatory Targets Reveals Effects on Breast Cancer Prognosis of the Regulome

    PubMed Central

    Andrews, Erik; Wang, Yue; Xia, Tian; Cheng, Wenqing; Cheng, Chao

    2017-01-01

    Gene expression regulators, such as transcription factors (TFs) and microRNAs (miRNAs), have varying regulatory targets based on the tissue and physiological state (context) within which they are expressed. While the emergence of regulator-characterizing experiments has inferred the target genes of many regulators across many contexts, methods for transferring regulator target genes across contexts are lacking. Further, regulator target gene lists frequently are not curated or have permissive inclusion criteria, impairing their use. Here, we present a method called iterative Contextual Transcriptional Activity Inference of Regulators (icTAIR) to resolve these issues. icTAIR takes a regulator’s previously-identified target gene list and combines it with gene expression data from a context, quantifying that regulator’s activity for that context. It then calculates the correlation between each listed target gene’s expression and the quantitative score of regulatory activity, removes the uncorrelated genes from the list, and iterates the process until it derives a stable list of refined target genes. To validate and demonstrate icTAIR’s power, we use it to refine the MSigDB c3 database of TF, miRNA and unclassified motif target gene lists for breast cancer. We then use its output for survival analysis with clinicopathological multivariable adjustment in 7 independent breast cancer datasets covering 3,430 patients. We uncover many novel prognostic regulators that were obscured prior to refinement, in particular NFY, and offer a detailed look at the composition and relationships among the breast cancer prognostic regulome. We anticipate icTAIR will be of general use in contextually refining regulator target genes for discoveries across many contexts. The icTAIR algorithm can be downloaded from https://github.com/icTAIR. PMID:28103241

  4. Protein asparagine deamidation prediction based on structures with machine learning methods.

    PubMed

    Jia, Lei; Sun, Yaxiong

    2017-01-01

    Chemical stability is a major concern in the development of protein therapeutics due to its impact on both efficacy and safety. Protein "hotspots" are amino acid residues that are subject to various chemical modifications, including deamidation, isomerization, glycosylation, oxidation etc. A more accurate prediction method for potential hotspot residues would allow their elimination or reduction as early as possible in the drug discovery process. In this work, we focus on prediction models for asparagine (Asn) deamidation. Sequence-based prediction method simply identifies the NG motif (amino acid asparagine followed by a glycine) to be liable to deamidation. It still dominates deamidation evaluation process in most pharmaceutical setup due to its convenience. However, the simple sequence-based method is less accurate and often causes over-engineering a protein. We introduce structure-based prediction models by mining available experimental and structural data of deamidated proteins. Our training set contains 194 Asn residues from 25 proteins that all have available high-resolution crystal structures. Experimentally measured deamidation half-life of Asn in penta-peptides as well as 3D structure-based properties, such as solvent exposure, crystallographic B-factors, local secondary structure and dihedral angles etc., were used to train prediction models with several machine learning algorithms. The prediction tools were cross-validated as well as tested with an external test data set. The random forest model had high enrichment in ranking deamidated residues higher than non-deamidated residues while effectively eliminated false positive predictions. It is possible that such quantitative protein structure-function relationship tools can also be applied to other protein hotspot predictions. In addition, we extensively discussed metrics being used to evaluate the performance of predicting unbalanced data sets such as the deamidation case.

  5. Discovery of Deep Structure from Unlabeled Data

    DTIC Science & Technology

    2014-11-01

    GPU processors . To evaluate the unsupervised learning component of the algorithms (which has become of less importance in the era of “big data...representations to those in biological visual, auditory, and somatosensory cortex ; and ran numerous control experiments investigating the impact of

  6. Materials Discovery | Photovoltaic Research | NREL

    Science.gov Websites

    and specialized analysis algorithms. The Center for Next Generation of Materials by Design (CNGMD) is , incorporating metastable materials into predictive design, and developing theory to guide materials synthesis design, accuracy and relevance, metastability, and synthesizability-to make computational materials

  7. Swarm intelligence in bioinformatics: methods and implementations for discovering patterns of multiple sequences.

    PubMed

    Cui, Zhihua; Zhang, Yi

    2014-02-01

    As a promising and innovative research field, bioinformatics has attracted increasing attention recently. Beneath the enormous number of open problems in this field, one fundamental issue is about the accurate and efficient computational methodology that can deal with tremendous amounts of data. In this paper, we survey some applications of swarm intelligence to discover patterns of multiple sequences. To provide a deep insight, ant colony optimization, particle swarm optimization, artificial bee colony and artificial fish swarm algorithm are selected, and their applications to multiple sequence alignment and motif detecting problem are discussed.

  8. Have artificial neural networks met expectations in drug discovery as implemented in QSAR framework?

    PubMed

    Dobchev, Dimitar; Karelson, Mati

    2016-07-01

    Artificial neural networks (ANNs) are highly adaptive nonlinear optimization algorithms that have been applied in many diverse scientific endeavors, ranging from economics, engineering, physics, and chemistry to medical science. Notably, in the past two decades, ANNs have been used widely in the process of drug discovery. In this review, the authors discuss advantages and disadvantages of ANNs in drug discovery as incorporated into the quantitative structure-activity relationships (QSAR) framework. Furthermore, the authors examine the recent studies, which span over a broad area with various diseases in drug discovery. In addition, the authors attempt to answer the question about the expectations of the ANNs in drug discovery and discuss the trends in this field. The old pitfalls of overtraining and interpretability are still present with ANNs. However, despite these pitfalls, the authors believe that ANNs have likely met many of the expectations of researchers and are still considered as excellent tools for nonlinear data modeling in QSAR. It is likely that ANNs will continue to be used in drug development in the future.

  9. Accelerating Chemical Discovery with Machine Learning: Simulated Evolution of Spin Crossover Complexes with an Artificial Neural Network.

    PubMed

    Janet, Jon Paul; Chan, Lydia; Kulik, Heather J

    2018-03-01

    Machine learning (ML) has emerged as a powerful complement to simulation for materials discovery by reducing time for evaluation of energies and properties at accuracy competitive with first-principles methods. We use genetic algorithm (GA) optimization to discover unconventional spin-crossover complexes in combination with efficient scoring from an artificial neural network (ANN) that predicts spin-state splitting of inorganic complexes. We explore a compound space of over 5600 candidate materials derived from eight metal/oxidation state combinations and a 32-ligand pool. We introduce a strategy for error-aware ML-driven discovery by limiting how far the GA travels away from the nearest ANN training points while maximizing property (i.e., spin-splitting) fitness, leading to discovery of 80% of the leads from full chemical space enumeration. Over a 51-complex subset, average unsigned errors (4.5 kcal/mol) are close to the ANN's baseline 3 kcal/mol error. By obtaining leads from the trained ANN within seconds rather than days from a DFT-driven GA, this strategy demonstrates the power of ML for accelerating inorganic material discovery.

  10. Sign use and cognition in automated scientific discovery: are computers only special kinds of signs?

    NASA Astrophysics Data System (ADS)

    Giza, Piotr

    2018-04-01

    James Fetzer criticizes the computational paradigm, prevailing in cognitive science by questioning, what he takes to be, its most elementary ingredient: that cognition is computation across representations. He argues that if cognition is taken to be a purposive, meaningful, algorithmic problem solving activity, then computers are incapable of cognition. Instead, they appear to be signs of a special kind, that can facilitate computation. He proposes the conception of minds as semiotic systems as an alternative paradigm for understanding mental phenomena, one that seems to overcome the difficulties of computationalism. Now, I argue, that with computer systems dealing with scientific discovery, the matter is not so simple as that. The alleged superiority of humans using signs to stand for something other over computers being merely "physical symbol systems" or "automatic formal systems" is only easy to establish in everyday life, but becomes far from obvious when scientific discovery is at stake. In science, as opposed to everyday life, the meaning of symbols is, apart from very low-level experimental investigations, defined implicitly by the way the symbols are used in explanatory theories or experimental laws relevant to the field, and in consequence, human and machine discoverers are much more on a par. Moreover, the great practical success of the genetic programming method and recent attempts to apply it to automatic generation of cognitive theories seem to show, that computer systems are capable of very efficient problem solving activity in science, which is neither purposive nor meaningful, nor algorithmic. This, I think, undermines Fetzer's argument that computer systems are incapable of cognition because computation across representations is bound to be a purposive, meaningful, algorithmic problem solving activity.

  11. Research of Ad Hoc Networks Access Algorithm

    NASA Astrophysics Data System (ADS)

    Xiang, Ma

    With the continuous development of mobile communication technology, Ad Hoc access network has become a hot research, Ad Hoc access network nodes can be used to expand capacity of multi-hop communication range of mobile communication system, even business adjacent to the community, improve edge data rates. When the ad hoc network is the access network of the internet, the gateway discovery protocol is very important to choose the most appropriate gateway to guarantee the connectivity between ad hoc network and IP based fixed networks. The paper proposes a QoS gateway discovery protocol which uses the time delay and stable route to the gateway selection conditions. And according to the gateway discovery protocol, it also proposes a fast handover scheme which can decrease the handover time and improve the handover efficiency.

  12. An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

    PubMed

    Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P

    2007-03-15

    Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.

  13. Binary Interval Search: a scalable algorithm for counting interval intersections.

    PubMed

    Layer, Ryan M; Skadron, Kevin; Robins, Gabriel; Hall, Ira M; Quinlan, Aaron R

    2013-01-01

    The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. https://github.com/arq5x/bits.

  14. Systems and methods for knowledge discovery in spatial data

    DOEpatents

    Obradovic, Zoran; Fiez, Timothy E.; Vucetic, Slobodan; Lazarevic, Aleksandar; Pokrajac, Dragoljub; Hoskinson, Reed L.

    2005-03-08

    Systems and methods are provided for knowledge discovery in spatial data as well as to systems and methods for optimizing recipes used in spatial environments such as may be found in precision agriculture. A spatial data analysis and modeling module is provided which allows users to interactively and flexibly analyze and mine spatial data. The spatial data analysis and modeling module applies spatial data mining algorithms through a number of steps. The data loading and generation module obtains or generates spatial data and allows for basic partitioning. The inspection module provides basic statistical analysis. The preprocessing module smoothes and cleans the data and allows for basic manipulation of the data. The partitioning module provides for more advanced data partitioning. The prediction module applies regression and classification algorithms on the spatial data. The integration module enhances prediction methods by combining and integrating models. The recommendation module provides the user with site-specific recommendations as to how to optimize a recipe for a spatial environment such as a fertilizer recipe for an agricultural field.

  15. A Knowledge Discovery Approach to Diagnosing Intracranial Hematomas on Brain CT: Recognition, Measurement and Classification

    NASA Astrophysics Data System (ADS)

    Liao, Chun-Chih; Xiao, Furen; Wong, Jau-Min; Chiang, I.-Jen

    Computed tomography (CT) of the brain is preferred study on neurological emergencies. Physicians use CT to diagnose various types of intracranial hematomas, including epidural, subdural and intracerebral hematomas according to their locations and shapes. We propose a novel method that can automatically diagnose intracranial hematomas by combining machine vision and knowledge discovery techniques. The skull on the CT slice is located and the depth of each intracranial pixel is labeled. After normalization of the pixel intensities by their depth, the hyperdense area of intracranial hematoma is segmented with multi-resolution thresholding and region-growing. We then apply C4.5 algorithm to construct a decision tree using the features of the segmented hematoma and the diagnoses made by physicians. The algorithm was evaluated on 48 pathological images treated in a single institute. The two discovered rules closely resemble those used by human experts, and are able to make correct diagnoses in all cases.

  16. Discovering loose group movement patterns from animal trajectories

    USGS Publications Warehouse

    Wang, Yuwei; Luo, Ze; Xiong, Yan; Prosser, Diann J.; Newman, Scott H.; Takekawa, John Y.; Yan, Baoping

    2015-01-01

    The technical advances of positioning technologies enable us to track animal movements at finer spatial and temporal scales, and further help to discover a variety of complex interactive relationships. In this paper, considering the loose gathering characteristics of the real-life groups' members during the movements, we propose two kinds of loose group movement patterns and corresponding discovery algorithms. Firstly, we propose the weakly consistent group movement pattern which allows the gathering of a part of the members and individual temporary leave from the whole during the movements. To tolerate the high dispersion of the group at some moments (i.e. to adapt the discontinuity of the group's gatherings), we further scheme the weakly consistent and continuous group movement pattern. The extensive experimental analysis and comparison with the real and synthetic data shows that the group pattern discovery algorithms proposed in this paper are similar to the the real-life frequent divergences of the members during the movements, can discover more complete memberships, and have considerable performance.

  17. Mystic: Implementation of the Static Dynamic Optimal Control Algorithm for High-Fidelity, Low-Thrust Trajectory Design

    NASA Technical Reports Server (NTRS)

    Whiffen, Gregory J.

    2006-01-01

    Mystic software is designed to compute, analyze, and visualize optimal high-fidelity, low-thrust trajectories, The software can be used to analyze inter-planetary, planetocentric, and combination trajectories, Mystic also provides utilities to assist in the operation and navigation of low-thrust spacecraft. Mystic will be used to design and navigate the NASA's Dawn Discovery mission to orbit the two largest asteroids, The underlying optimization algorithm used in the Mystic software is called Static/Dynamic Optimal Control (SDC). SDC is a nonlinear optimal control method designed to optimize both 'static variables' (parameters) and dynamic variables (functions of time) simultaneously. SDC is a general nonlinear optimal control algorithm based on Bellman's principal.

  18. Generalization Performance of Regularized Ranking With Multiscale Kernels.

    PubMed

    Zhou, Yicong; Chen, Hong; Lan, Rushi; Pan, Zhibin

    2016-05-01

    The regularized kernel method for the ranking problem has attracted increasing attentions in machine learning. The previous regularized ranking algorithms are usually based on reproducing kernel Hilbert spaces with a single kernel. In this paper, we go beyond this framework by investigating the generalization performance of the regularized ranking with multiscale kernels. A novel ranking algorithm with multiscale kernels is proposed and its representer theorem is proved. We establish the upper bound of the generalization error in terms of the complexity of hypothesis spaces. It shows that the multiscale ranking algorithm can achieve satisfactory learning rates under mild conditions. Experiments demonstrate the effectiveness of the proposed method for drug discovery and recommendation tasks.

  19. Secbase: database module to retrieve secondary structure elements with ligand binding motifs.

    PubMed

    Koch, Oliver; Cole, Jason; Block, Peter; Klebe, Gerhard

    2009-10-01

    Secbase is presented as a novel extension module of Relibase. It integrates the information about secondary structure elements into the retrieval facilities of Relibase. The data are accessible via the extended Relibase user interface, and integrated retrieval queries can be addressed using an extended version of Reliscript. The primary information about alpha-helices and beta-sheets is used as provided by the PDB. Furthermore, a uniform classification of all turn families, based on recent clustering methods, and a new helix assignment that is based on this turn classification has been included. Algorithms to analyze the geometric features of helices and beta-strands were also implemented. To demonstrate the performance of the Secbase implementation, some application examples are given. They provide new insights into the involvement of secondary structure elements in ligand binding. A survey of water molecules detected next to the N-terminus of helices is analyzed to show their involvement in ligand binding. Additionally, the parallel oriented NH groups at the alpha-helix N-termini provide special binding motifs to bind particular ligand functional groups with two adjacent oxygen atoms, e.g., as found in negatively charged carboxylate or phosphate groups, respectively. The present study also shows that the specific structure of the first turn of alpha-helices provides a suitable explanation for stabilizing charged structures. The magnitude of the overall helix macrodipole seems to have no or only a minor influence on binding. Furthermore, an overview of the involvement of secondary structure elements with the recognition of some important endogenous ligands such as cofactors shows some distinct preference for particular binding motifs and amino acids.

  20. DMINDA: an integrated web server for DNA motif identification and analyses

    PubMed Central

    Ma, Qin; Zhang, Hanyuan; Mao, Xizeng; Zhou, Chuan; Liu, Bingqiang; Chen, Xin; Xu, Ying

    2014-01-01

    DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at http://csbl.bmb.uga.edu/DMINDA/. This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular. PMID:24753419

Top